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This summary discusses all 5 parts of Information Storage 
and Retrieval (ISR-18), which is available in its entirety as 
LI 002 719. , Only the papers from Pa-t One are reproduced here 
as LI C02 720. See LI 002 721 thru LI 002 724 for Parts 2 - 5. 

Summary 

The present report ir the eighteenth in a series describing research 
in automatic information storage and retrieval conducted by the Department 
of Computer Science at Cornell University. The report covering work carried 
out by the SMART project for approximately one year (summer 1969 to summer 
1970) is separated into five parts: automatic content analysis (Sections 

I to IV) , automatic dictionary construct .on (Sections V to VII ) , user feed- 
back procedures (Sections VIII to XI) , document and query clustering methods 
(Sections XII and XIII) , and SMART systems design for on-line operations 
(Sections XIV and XV) - 

Most recipients of SMART project reports will experience a gap in 
the series of scientific reports received to date. Report ISR-17, consisting 
of a master's thesis by Thomas Brauen entitled "Document Vector Modification 
in On-line Information Retrieval Systems" was prepared for limited distribu- 
tion during the fall of 1969. Report ISR-17 is available from the National 
Technical Information Service in Springfield, Virginia 22151, under order 
number PB 186-135. 

The SMART system continues to operate in a batch processing mode 
on the IBM 360 mode] 65 system at Cornell University. The standard processing 
mode is eventually to be replaced by an on-line system using time-sh ired 
console devices for input and output. The overall design for such an on-line 
version of SMART has been completed, and is described in Section XIV of tha 
present report. While awaiting the time-sharing implementation of the 
system, new retrieval experiments hove been performed using larqer document 
collections within the existing system. Attempts to compare the performance 
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of several collections of different sizes must take into account the 
collection "generality". A study of this problem is made in Section II of 
the present report. Of special interest may also be the new procedures 
for the automatic recognition of "common" words in English texts (Section 
VI) , and the automatic construction of thesauruses and dictionaries for use 
in an automatic language analysis system (Section VII) . Finally, a new 
inexpensive method of document classification and term grouping is 
described and evaluated in Section XII of the present report. 

Sections I to IV cover experiments in automatic content analysis 
and automatic indexing. Section I by S . F. Weiss contains the results of 
experiments, using statistical and syntactic procedures for the automatic 
recognition of phrases in written texts. It is shov;n once again that be- 
cause of the relative heterogeneity of most document collections, and 
the sparseness of the document space, phrases are not normally needed 
for content identification. 

In Section II by G. Salton, the "generality" problem is examined 
which arises when two or more distinct collections are compared in a 
retrieval environment. It is shown that proportionately fewer nonrelevant 
items tend to be retrieved when larger collections (of low generality) 
are used, than when small, high generality collections serve for evaluation 
purposes. The systems viewpoint thus normally favors the larger, low 
generality output, whereas the user viewpoint prefers the performance of 
the smaller collection. 

The effectiveness of bibliographic citations for content analysis 
purposes is examined in Section III by G. Salton. It is shown that in 
some situations when the citation space is reasonably dense, the use of 

xv i 



citations attached to documents is even more effective than the use «>f 
standard keywords or descriptors. Ir. any case, citations should be jtdded 
to the normal descriptors whenever they happen to be available. 

In the last section of Part 1, certain template analysis methods 
are applied to the automatic resolution of ambiguous constructions 
(Section IV by S. F. Weiss) . It is shown that a set of contextual rules 
can be constructed by a semi-automatic learning process, which will eventually 
lead to an automatic recognition of over nine\*.y percent the existing 
textual ambiguities. 

Part 2, consisting of Sections V, VI and VII covers procedures 
for the automatic construction of dictionaries and thesauruses useful in 
text analysis systems. In Section V by D. Bergmark it is shown that word 
stem methods using large common word lists are more effective in an infor- 
mation retrieval environment that some manually constructed thesauruses, 
even though the latter also include synonym recognition facilities. 

A new model for the automatic determination of "common " words 
(which are not to be used for content identification) is proposed and 
evaluated in fv.ction VI by K. Bonwit and J. Aste-Tonsmann . The resulting 
process can be incorporated into fully automatic dictionary construction 
systems. The complete thesaurus construction problem is reviewed in Section 

VII by G. Salton, ana the effectiveness of a variety of automatic dictionaries 
is evaluated. 

Part 3, consisting of Sections VIII through XI, deals with a 
number of refinements of the normal relevance feedback process which has 
been examined in a number of previous reports in this series. In Section 

VIII by T. P. Baker, a query splitting process is evaluated in which input 
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queries are split into two or more parts during feedback whenever the 
relevant documents identified by the user are separated by one or .lore non- 
r e levant ones. 

The effectiveness of relevance feedback techniques in an environ- 
ment of variable generality is examined in Section IX by B. Capps and M. 
Yin. it is shown that some of the feedback techniques are equally applica- 
ble to collections of small and large generality. Techniques of negative 
feedback (when no relevant items are identified by the users, but only 
nonrelevant ones) are considered in Section X by M. Kerchner. It is shown 
that a number of selective negative techniques, in which only certain 
specific concepts are actually modified during the feedback process, bring 
good improvements in retrieval effectiveness over the standard nonselective 
methods . 

Finally, a new feedback methodology in which a number of documents 
jointly identified as rele/ant to earlier queries are used as a set for 
relevance feedback purposes is proposed and evaluated in Section XI by L. 
Paavola. 

Two new clustering techniques are examined in Part 3 of this report 
consisting of Sections XII and XIII. A controlled, inexpensive, single-pas 
clustering algorithm is described and evaluated in Section XII by D. B. 
Johnson and J. M. Lafuente. In this clustering method, each document is 
examined only once, and the procedure is shown to be equivalent in certain 
circuits tances to other more demanding clustering procedures. 

The query clustering process, in which query groups are used to 
define the information search strategy is studied in Section XIII by S, 
Worona. A variety of parameter values is evaluated in a retrieval environ- 




xviii 



ment to be used for cluster generation, centroid definition, and final 



search strategy. 

The last part, number fire, consisting of Sections XIV and XV, 
covers the design of on-line information retrieval systems. A new 
SMART system design for on-line use is proposed in Section XIV by D. and 
R. Williamson, based on the concepts of pseudo-batching and the interaction 
of a cycling program with a console monitor. The user interface and 
conversational facilities are also described. 

A template analysis technique is used in Section XV by S . F. Weiss 
for the implementation of conversational retrieval systems used in a time- 
sharing environment. The effectiveness of the method is discussed, as 
well as its implementation in a retrieval situation. 

Additional automatic content analysis and sec.rch procedures used 
with the SMART system are described in several previous reports in this 
series, including notably reports ISR-11 to ISR-16 published between 1966 
and 1969. These reports are all available from the National Technical 
Information Service in Springfield, Virginia. 



G. Salton 
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I, Content Analysis; in Information Retrieval 
S. r. Weiss 



Abstract 

In information retrieval there exist a number of content analysis 
schemes which analyze natural language text to varying degrees of complexity. 
Regardless of how well the text analysis is performed by each process, 
the true value of a given process lies in its effectiveness as an information 
retrieval tool. The performance may in each case be investigated by 
actual retrieval tests using the various proposed content analysis schemes. 

Results obtained with a variety of linguistic phrase recognition 
methods show that very little, if any, improvements i , retrieval effectiveness 
are obtained when any of the refined content analysis schemes are used 
with existing document collections. The main reason appears to be the fact 
that the value of refined content analysis systems resides in their 
effectiveness in separating lexically similar, but semantically different 
documents. Existing collections ^re too sparse, and do not contain many 
close documents. When denser collections are created, it can be shown that 
linguistic content analysis methods become of increasing value as the density 
increases. The queries also influence the type of content analysis to be 
used. In general, queries of the question-answering variety show improved 
retrieval results with increasing refinements in the content analysis. 

Document retrieval queries do not exhibit this type of improvement. 

Future work must be devoted to a determination of what makes a user 
judge a particular document to be relevant. With more insight into the 
relevance area, the role of linguistic content analysis in information 
retrieval may become more clearly defined. 

ERIC 
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1. Introduction 

The purpose of a content analysis system as considered in this study 
is as an information retrieval aid. It is therefore necessary to perform 
retrieval using various content analysis methods to determine how well it 
fulfills its actual role. This study presents experiments and results 
aimed at determining the conditions under which content analysis improves 
retrieval results as well as the degree of improvement obtained. All 
information retrieval systems use some degree of content analysis in its 
broadest sense. This is generally in the form of assignment of concept 
indicators to individual words. But in this study content analysis refers 
to the analysis and utilization of multi-word groups as information 
retrieval tools. 

Using phrases determined by content analysis as an information 
retrieval aid is theoretically very appealing, it adds another dimension 
to search capabilities beyond the single word matching used by most 
infromation retrieval systems. Documents and queries are matched not 
only on content# but on the interrelationship of content elements as well. 
Hutchins 1 3 ] has proposed an information retrieval system based solely 
on the cooccurrence of phrases in documents and queries. However, some 
experiments indicate that phrases alone may be too strict a criterion 
for useful results. A more reasonable approach is to use phrases in 
conjunction with a less structured method such as word or concept matching, 
lnerefore in this study phrases are considered as an adjunct to single concept 
matching. 

A number of existing information retrieval systems permit 
searching on multi-word structured information. Some systems such as that 

o 
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and queries by contiguous word pairs as well as individual words. Retrieval 
is thus aided by this rudimentary form of phrase analysis. The IBM 
Document Processing System (43 takes this capability one step further. 
Multi-word search keys can be specified using a number* of options besides 
simple contiguity. For example, consider the sample queries below. Query A 
retrieves documents containing "in formation” and "retrieval" in that order 
and separated by at most one other word. Query B retrieves documents 
with the same two words separated by at most one word but with no restriction 
on ordering. This will retrieve "information retrieval" as wvil as 
"retrieval of information". Queries C and D further relax the proximity 
criterion and retrieve documents in which "information" and "retrieval" 
occur within the same sentence and the same paragraph respectively. 



This specification is an attempt to perform some degree of semantic 
normalization. It permits the association of phrases which are semantically 
similar but structurally uifferent, However the IBM system and others like 
it approach the semantic normalization by structural rather than semantic 
means. The resultant semantic processes are hence necessarily very 
superficial. As Lesk points out, phrases determined by processes of this 
type may cooccur in documents and queries too infrequently for them to be 
of any practical value. Lesk therefore proposes an information retrieval 
system in which documents and queries are subjected to a complex syntactic 
O semantic analysis. Phrase normalization is then baser on meaning rather 



A. INFORMATION RETRIEVAL (+1) 



B. INFORMATION RETRIEVAL (- + 1) 



C. INFORMATION RETRIEVAL (SEN) 



D. INFORMATION RETRIEVAL (PAR) 
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than just structure [5] . A few ,ther semantically based content analysis 
schemes exist such as the manual indexing process developed by Mandersloot, 
Douglas and Spicer [2]. Of all existing information retrieval systems with 
content analysis capabilities, the SMART system provides the greatest 
variety of content analysis methods. This makes SMART an excellent 
experimental facility for testing content analysis in general. The various 
SMART content analysis methods are presented in some detail later in this 
study . 

In information retrieval, phrases can do two things. First, they 
can distinguish between two documents with similar content elements but 
different meaning. For example, the two inputs below are assigned identical 
concept vectors by normal text cracking methods. T'j distinguish between 
then requires that the structure as well as the content of the input be 
considered. 



A. Design of computer systems 

B. Computerized design systems 



A second job performed by phrases is that of reinforcing correlations 
between queries and documents which have similar phrases. In this way the 
cooccurrence in the document and query of concepts which form a phrase is 
weighted more heavily then the cooccurrence of a similar number of unrelated 
concepts. While this might appear to be a convincing car.e in favor of using 
phrases in information retrieval, the previous argument is purely theoret- 
ical. It remains to test the theory by performing retrieval using various 
phrase determination methods, It is necessary to analyze the results 
obtained not only to determine how the overall results compare with those 
achieved without the use of phrases, but also to determine the exact causa 
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of the phrase method results. Thac is, are 
the document or query collections used/ the 
the matching procedure, or a combination of 



the new results £ function of 
phrase determining technique, 
several factors? 



2. ADI Experiments 

' r he first set of experiments uses the ADI collection. This is 
e set of eighty-two documents and thirty-five queries in the field of documen- 
tation. About half of the queries ask fur specific information while the 
other half are of a more general nature. A set of ten queries, five general 
and five specific, is chosen as representative of the various query forms 
and constructions. A normal SMART retrieval run is then performed on the 
entire ADI collection and the ten test queries. For each query the ten 
most highly correlated documents are identified. These documents along 
with any others, relevant to the test queries but not in the top ten, are 
collected to form a test document set. The total set contains 56 of the 32 
ADI documents. In all the experiments phrases are determined for this test 
set only. It is felt that the results achieved with this limited set will 
differ little from those of the full set. The use of a restricted set 
such as this is also a practical necessity sir the great quantity of hand 
analysis required by these experiments precludes tne use of the full docu- 
ment and query sets. Figure 1 indicates the results of a normal cosine 
retrieval process using the ten test queries. The following subsections 
discuss experimentation using various phrase determining techniques. 




A) Statistical Phrases 

Hie s tatistical phrase process uses a predetermined list of phrases. 
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Standard Smart Results 
(No Phrases) 

Figure I 
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The occurrence of the phrase elements in a document or query if considered 
an occurrence of that phrase regardless of the syntactic relation of the 
phrase components. A concept number is associated with a phrase and the 
appropriate concepts are appended to the document or query vectors. This 
method is clearly the simplest way to determine phrases since it requires 
no syntactic analysis of the text. However, statistical phrases have 
some serious drawbacks. Most obvious is the fact that they may recognize 
false phrases; that is, occurrences of the desired phrase elements but 
not in the proper syntactic relation. This problem can be minimized in 
small collections dealing with a narrow subject area by judicious selection 
of the statistical phrase list. In a corpus dealing with computer systems, 
for example, the occurrence of the words 1 real 1 ' and "time" can be viewed 
with relative certainty to be an occurrence of the phrase "real time". 

TJ owever as the collection grows and the subject area broadens, these 
decisions become less certain. Also the difficulty in creating the phrase 
list is increased as the corpus is enlarged. The phrase list can be 
determined by statistical means; however, weaknesses in this method can 
create problems. In the ADI collection for example, of the 409 statistical 
phrases in the test document set, only 153, roughtly 37%, are syntactically 
correct. Figure 2 shows the results achieved using statistical phrases 
along with the standard no-phrase results. The results for statistical 
phrases are slightly higher in places, lower in others and show no signifi- 
cant overall improvement in retrieval quality. 

B) Syntactic Phrases 

As mentioned previously, almost two-thirds of the statistical phrases 
determined for the test set turn out to be syntactically incorrect. Removal 
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Experiment 2: Statistical Phrases 
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of the false phrases would allow the phrase component of the concept vector 
to represent more closely the true structure of the document or query. An 
automated process to perform this would first locate statistical phrases and 
then, using some syntactic analysis technique, weed out the erroneous ones. 
The s/ntactic analysis process required here is considerably simpler than 
general syntactic analysis since the process need only check the correctness 
of a statistical phrase rather than perform a complete syntactic parse. 
However, since the purpose of this study i > to determine the value of 
syntactic phrases as a retrieval aid and not to test a syntactic analyzer, 
the analyses are done by hand. Removal of false phrases leaves 153 of the 
original 409 document phrases and 6 of the 12 query phrases. Results of 
this process are presented in Figure 3, and are again, disappointing. 
Statistical phrases show no significant improcement in retrieval performance. 

C) Cooccurrence 

The easiest way to handle phrases, and the way used in the previous 
experiments, is simply to assign each phrase a concept number and append 
the number onto the appropriate concept vector. After assignment, phrase 
concepts become indistinguishable from single word concepts, and the 
correlation coefficient operates normally. Unfortunately this gives rise 
to a number of serious problems. First, is the dilution effect caused by 
unmatched phrase concepts. The probability of a phrase match between a 
document and query is quite small due to the added structural requirements 
inherent in phrase matching. Furthermore since documents are typically 
much longer than queries, the document contains many phrases which cannot 
possibly match the query. As a conseauence many phrase concepts are not 
matched. These unmatched concepts lower the correlation and partially if 
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not completely offset any gain achieved by matched phrases. Thus tie 
inclusion of too many phrases can dilute the vector with ynusiible information 
and inferior results may be produced. 



indicator. Individual word concepts are about equal as relevancy and 
nonrelevancy indicators. That is the cooccurrence of concept A in document 
D and query Q is as good a measure of D's relevance to Q as the lack, of this 
cooccurrence is a measure of D's nonrelevance - As more structure is 
imposed on the comparison of documents and queries, cooccurrences become 
more significant but less frequent while non-cooccurring structures becc/i.e 
less significant and more frequent. For example if documents are retrieved 
only if they match, word for word, the complete query, few if any documents 
would be returned. However any document which is retrieved by this scheme 
would almost certainly be relevant. On the other hand, the fact that 
some documents do not match the complete query is not a good indicator of 
their nonrelevance. The situation is sin j.ar for phrases. Thus treating 
phrase concepts simply as additional word concepts over-enphasi zes their role 
as . onrelevancy indicators and while it may provide improved precision, it 
has disastrous effects on recall. 



word concepts differently. In particular the role of phrases as a relevancy 
indicator must be weighted much more heavily than their role as a nonre)evancy 
indicator. The method designed to accomplish this is called cooccurrenc e 
matching and considers phrases only when they cooccur between a document and 
a query. Its operation may be seen from the following example. Let D ^r.d 
Q be the word concept vectors for a par icular document and query, and PL 
and PQ, their associated phrase concept vectors. if phrase concepts are 



A second problem deals with the value of a phrase as a nonrelevancy 



The problems presented above make it necessary to treat phrase and 
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treated as word concepts, the correlation is calculated between D + PD 
and Q + PQ. The cooccurrence method on the other hand first calculates 
C = PQ A PD. That is, C is the set of phrase concepts common to both the 
query and document. Correlation is then calculated between D + C and Q + C. 

In this way it is guaranteed that phrase concepts cannot lower the correlation 
and in the worst case where C is empty, the correlation is unaffected by 
the phrases. This process avoids the two previously discussed pitfalls 
associated with phrase use. First, by ignoring all unmatched phrase 
concepts, the vectors cannot become diluted with useless and possibly 
detrimental information. Secondly, phrases are used omy as a relevancy 
indicctoi while their far weaker role of nonrelevancy indicator is not 
considered. The experiments performed in the remainder of this study all 
employ the cooccurrence principle for handling phrase concepts. The next 
two experiments are repeats of the previous two with the addition of the 
use of the cooccurrence phrase matching technique. The results are 
shown in Figures 4 and 5 and once again show no improvement over the no 
phrase method. A more complete analysis of these results is presented 
below. 



D) Elimination of the Phrase List 

All methods discussed so far for using phrases in retrieval have 
required a phrase list. As previously mentioned the creation of these 
lists, whether by hand or by statistical processes, raises certain inher- 
ent problems. In general, it is far more desirable to be able to determine 
phrases without the need of such a list. One possible solution is to per- 
for.T. a syntactic analysis of the text, and determine all the phrases. 

The set of phrases thus generated is then normalized to associate all 
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syntactically different but semantically identical phrases. This is accom- 
plished, for example, by transformational kernelization of the phrases 
or by the use of a criterion tree matching scheme. Each phrase in the 
reduced set is then assigned a concept number, and retrieval proceeds 
as in the previous cases. However the syntactic analysis and normalization 
processes are prohibitively complex and produce a very large number of 
phrases, lor these reasons an alternate method is used. 

One of the easiest ways of accomplishing some degree of phrase 
processing without a phrase list is by means of the i mplicit phrase method. 
The philosophy behind this technique is that the cooccurrence in the docu- 
ment and query of several different concepts should be considered a 
better relevancy indicator than the cooccurrence of a <j ingle concept which 
has multiple occurrences and hence a higher weight. Consider the sample 
query and document vectors in Figure 6. The cosine correlation assigns 
the same correlation value to both. The second document however would seem 
to be more relevant to the query. The use of implicit phrases allows this 
fact to be reflected in the final correlation value. The basis of this 
process is a modified correlation coefficient f ormu, a : 



C 



dq 



N 

V d.a, + K(rti-l) 
i = l 1 1 





where m is the number of different concepts which ccr cvr in the document 
and query, and K is a constant. In the general case y .4=*P where P is 
an experimental parameter. In this way each pair ot ev-ccimng concepts 
in the document and query is treated as a phrase and tho cm relation is 
is treated accordingly. In Figure 6 for example, the ir \ licit phrases 
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QUERY: INFORMATION RETRIEVAL 

DOC-1 INFORMATION ABOUT INFORMATION 

DOC- 2 INFORMATION RETRIEVAL AND SYSTEMS ANALYSIS 



VECTORS : 
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Sample Document and Query Vectors 
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correlation between document 1 and the query remains unchanged while the 
correlation of document ? is raised to U.774 thus reflecting its apparent 
greater relevancy. Figure 7 shows the results of retrieval using the ADI 
collection and the implicit phrase process with various values for P. It 
indicates that some improvement is achieved over the no-phrase process. 
However, one of the main drawbacks of the process is that it fails to ful- 
fill one of the primary objectives for phrase use. That is it cannot 
discriminate between documents with similar concepts but different structural 
relationships among these concepts. For this reason a more syntactically 
oriented approach to phrase processing must be used. 

The syntactic process used is relational content analysis. This 
process determines syntactic relations between pairs of text words. The 
details of relational content analysis are discussed by Weiss {9] . Concepts 
which are determined to be related by the content analyzer are encoded into 
a special phrase concept number, XXXXYYYYZZ, where XX/ V represents the con- 
cept number of the first word, YYYY the second, and ZZ is the relation 
between them. The order of the two concepts is significant for all relations 
except parallel in which the smaller concept number appears first. The 
encoded relational phrases are treated as concept numbers and assembled into 
a phrase concept vector. The phrase vector must be kept separate from the 
word vector to permit the use of the cooccurrence phrase matching process. 

The retrieval results for this technique with the ADI test set appear in 
Figure 8. 

Using this type of process for phrase determination has a number 
of advantages. First, it alleviates the need for an a priori phrase list. 

Also, being a re. atively simple process, it has significantly more practical 

r} 

value than some of tne more complex systems. Clearly a great deal of 
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Experiment 7 : Relational Content Analysis 
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syntactic information is lost since only word pairs are considerel however, 
cooccurrences in documents and queries of syntactic structures more complex 
than word pairs is exceedingly rare. Thus despite its simplicity, relation- 
al content analysis does perform the particular aspect oi r syntactic analysis 
most relevant to information retrieval. Besides the advantages there are 
also some dir advantages inherent in this type of system. Most serious 
is its inability to associate semantically similar phrases, A syst sm that 
uses a phrase list can recognize equivalent phrases whose constituent 
concepts are not equivalent. For example, the phrases "memory holding" 
and "data processing" are both assigned the same phrase concept by the 
SMART phrase list for the ADI collection, while each of the four words 
falls into a different concept class. The recognition of such equivalent 
phrases is impossible for systems which do not employ such a list of 
extensive semantic normalization. It may therefore be expected that 
retrieval results achieved by the relational concept analyzer will be 
inferior to those achieved in previous experiments. However, retrieval 
without the requirement of a phrase list seems to be a more reasonable 
approach to the problem. This is especially true in the case of large 
document collections where manual creation of a phrase list is impossible 
and statistical creation in unreliable. 

E) Analysis of ADI Results 

The results of the reven retrieval experiments are summarized in 
Figure 9. The plus or minus to the right of each figure indicates whether 
if is above <+) or below (-) the standard no-phrase value achieved for 
that recall level, (experiment 1). The results clearly show that there 
is no great gain achieved by the use of phrases and in some cases their 
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use appears to be actually detrimental. However, upon more careful 
analysis of these results, a number of unusual factors are found which 
make these results somewhat less discouraging than they initially 
appear . 

Consider first the results obtained with the statistical and 
syntactic phrases. It is argued in section C that the use of cooccur- 
rence improves the retrieval quality. The results seem to indicate 
that exactly the opposite is true for experiment 4 and that experiment 
5 results exceed experiment 3 at only half of the recall points. 

Upon analysis of the retrieval output it is discovered that the reason 
for this apparent turnabout is the dilution of nonrelevant concept 
vectors due to unmatched concepts. For many of the queries analyzed, 
there is one or more documents, highly correlated to that query, but 
nonrelevant, and which has a relatively large number of phrases which 
are not matched in the query. Because of the dilution effect which 
occurs when cooccurrence is not used, the correlations for these docu- 
ments are lowered, often to a level below that of one of the relevant 
documents. The rank of the relevant document is thus raised by default 
even though its own correlation is not altered. Consider for example 
the correlation of document 11 with query A4 . With no phrases used, 
this nonrelevant document ranks sixth with a correlation of 0.248189. 

The document has 13 statistical phrases which do not match the 
query. When retrieval is performed using these phrases without cooccur- 
rence, the coefficient is reduced to 0.13599 and the rank lowered to 
ninth place. This allows one of the relevant documents to move ahead 
producing an apparent improvement in retrieval quality. When cooccurrence 
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is used there are no phrase matches, the coefficient remains 0.24818, and 
the relevant document is not allowed to move up. Considering the entire 
set of 33 documents relevant to the test queries, the ranks of 16 are 
improved by the use of statistical phrases with no cooccurrence. However, 
of these, only 7 actually move up in correlation coefficient. The remaining 
9 lose in correlation but gain in rank due to the dilution and consequent 
lowering of nonrelevant documents. Ten of the 33 relevant documents lose 
in bo^h rank and coefficient, mostly due to being diluted themselves, 
while 7 remained fixed in rank. Of these 7, 5 are reduced in coefficient 
but by an amount insufficient to drop the rank. Also most of the documents 
with a large number of phrases are not relevant to any test query. Thus 
the apparent superiority of the no-cooccurrence process (experiments 2 
and 3) over the normal method (experiment 1) and the cooccurrence process 
(experiments 4 and 5) is almost entirely due to the lowering of the 
correlation coefficient of certain nonrelevant cocuments. This in turn 
is aided by the fact that most documents with a large number of phrases 
are not relevant to any query. The reduction in rank of these documents 
with respect to any query is thus guaranteed to cause, at worst, no harm 
and possibly produce a default raise in rank of a relevant document:. This 
situation is clearly not typical. In general, every document must be 
considered as a potential relevant document. Lowering the rank for some set 
of documents for all queries would thus help retrieval in some cases, harm 
it in others. The results of experiments 2 and 3 reflect sore positive 
effect caused by increasing the correlation in relevant documents. However, 
this effect is quite snail. In general it can be c ncluded that since the 
condi cions which led to the results of experiments 2 and 3 cannot be considered 
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retrieval quality achieved with no-cooccurrence must therefore be held 
suspect. 



statistical and syntactic phrases with the cooccurrence technique. When 
compared with experiment 1, the results seem to indicate that the 
cooccurrence processes are harmful to retrieval quality. However/ this 
result is misleading as a result of a peculiar situation. This can be 
understood by considering the results of experiment 4. Of the 33 relevant 
documents, this phrase process improves both the rank and correlation for 
9; 5 are reduced in rank; while the remaining 19 are unchanged. Overall 
this seems to be an improvement, but the tabulated results in Figure 9 
do not bear this out. The reason for this lack of improvement lies 
almost entirely with query B5. It has only one relevant document and 
the phrase process lowers its rank from second to fifth thus lowering 
its precision for all recall levels from 0.5 to 0.2. This is a consider- 
able decrease in precision, and since the values are averaged over only 
ten queries, the effect on the average is substantial. If precision 
values are taken for the nine other queries only, the values for the phrase 
processes exceed those for the no-phrase experiment for nearly all recall 
levels. Thus except for a rather unusual query, these phrase processes 
using cooccurrence provide some degree of improved retrieval results. Ina 
main drawback of such a process is the need for an a priori phrase list. 

And it is for this reason that the major emphasis in this study is on 
phrase methods which do not require predetermined lists. 



the no-phrase-list method based on relational content analysis (experiment 
7) are inferior to both the phrase list and no-phrase results. This is in 



Attention is next focused on experiments 4 and 5 which use 



The tabulations in Figure 9 indicate that results achie\- . by using 
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part due to the method’s inability to associate phrases with different 
constituent concepts. The inferior results can also be blamed cn the 
very small number of cooccurrences. Of the more than 800 relations 
entered, only 28 cooccurrences between documents and queries are found. 

This very low number can be blamed, at least in part, on the queries. 

They are all quite short and contain very few phrases. The queries also 
: .end to be quite general. Since retrieval is performed by concept matching 
and not by hierarchical expansion, general queries do not always produce 
the desired results. Of the 28 cooccurrences, only 5 occur between a 
query and one of its relevant documents. In the ton test queries, three 
have no cooccurrences at all, and their results are clearly not altered 
from the no-phrase case. Four queries have cooccurrences in nonrelevant 
documents only and these results are obviously lowered. The three remaining 
queries have cooccurrences in relevant documents; however an improvement is 
realized in only one. Of the other two, one shows an improvement in 
correlation coefficient, but insufficient for a rank change, and the other 
has cooccurrences in nonrelevant documents which overshadow any improvement. 
These results might appear to cast some doubt on the value of this method. 
However this evidence is inconclusive and thus any decision is premature. 

From the previous experiments it appears that the various phrase 
and structure methods can provide some degree of improvement in retrieval 
quality. But this improvement may be insufficient to warrant the additional 
work needed to use them. This deficiency, however, cannot be blemed entirely 
on weaknesses in the methods used. In the introduction to this study one 
of the primary uses of phrases in information retrieval is stated to be the 
separation of highly correlated, but not semantically identical, documents. 

A document collection must therefore contain such close documents in order 
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for phrases to demonstrate any significant retrieval improvement. To deter- 
mine if the ADI collection provides a fair testbed for phrase use, a 
document- document correlation is preformed. The results indicate an average 
document- document correlation of 0.1 and a maximum of O.B. This indicates 
that the ADI document space is in general quite sparse; but it may still 
contain some dense clumps of documents. To test for this, a third statistic 
is calculated; the average maximum document-document correlation <AMC) , 

This is the correlation between a given document and its nearest neighbor 
averaged over all document- document pairs, in the ADI collection the AMC 
is less than 0.4 thus indicating the general absence of dense document 
clumps. Thus the documents in the ADI collection are seen to be quite spread 
out in the document space; and the extra dimension of refinement added 
to the documents and queries by the use of syntax is superfluous. There- 
fore to test more conclusively the usefulness of phrases in information 
retrieval, a more dense collection must be tried. Experiments with various 
other collections comprise the remainder of this study. 

3. The Cranfield Collection 

The Cranf ield-424 Collection is a set of 424 documents in the field 
of aerodynamics , Because of its single specialized theme it might conceivably 
provide a denser collection on which to perform phrase experiments. Unfor- 
tunately this is not the case. Results of a document- document correlation 
are effectively the same as those for the ADI. The average document- document 
correlation is less than 0.1 and the AMC is about 0,4, It nay therefore 
be expected that the Cranfield and ADT share the ^ame undesirable character- 
istics concerning phrase use. For this reason the Cranfield collection is 
not used in this study. 
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4. The TIME Subset Collection 

A) Construction 

Because the existing collections do not exhibit the desired 
characteristics for conclusive testing of phrase techniques, a new collection 
is constructed. The process for creating such a collection is as follows. 
From an existing set of documents and queries, a subset of closely related 
queries is chosen. The set of documents relevant to any query in the subset 
is taken as the new document collection. The fact that these documents are 
all relevant to closely related queries guarantees that the documents them- 
selves are also highly correlated. The collection chosen for this study 
is a set of articles from the "World' 1 section of "TIME Magazine" (19C3) 
with an associated set of current events queries. The largest number of 
related queries is six which deal with the Viet Harr, war and particularly with 
the religious and political strife leading up to the overthrow of the Dierr. 
government. A total of 27 documents are relevant to these queries and this 
forms the TIME subset collection. The relatively small size of this document 
set detracts somewhat from the significance of the results of experiments 
using it, but not as much as night be expected. This is true for several 
reasons. First, the subset can be thought of as a single cluster in a large 
clustered document set. Since the subset contains all of the Viet Nam 
articles, its cluster centroid would clearly correlate highly with any Viet 
Nam related query. The real retrieval problem than becomes picking the 
desired articles from within the cluster. And second, the purpose of this 
set is to test the usefulness of phrases in information retireval, and 




phrases are micro rather* than macro information retrieval aids. That is, 
the primary use for phrases is in determining fine differences in cioscly 
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related documents# and not in producing tremendous rank increases for low 
ranking documents. Thus this type of collection is sufficient for testing 
phrase processes. 

The TIME articles are written in a very conversational and chatty 
style as opposed to the technical style fo the ADI and Cranfield collections 
For example# a document dealing with the Vietnamese coup begins: 

Coping with Capricorn in business# count the costs before you 
act* The moon now in Capricorn suggests keeping practical 
values in mind* Tomorrow is rather too energetic for comfort, 
but that may be because everybody is on the move* (A late 
August horoscope.) Syndicated horoscopes, many of them from 
abroad are a popular feature in many South Vietnamese news- 
papers, but last week the government banned them, presumably 
on a theory that some star-minded dissident might be moved 
to try a coup on an astrologically auspicious day. 

["TIME", 9/6/63, page 19] 

The article then presents its true purpose, that of describing the increas- 
ing United States dissatisfaction with the present South Vietnamese govern- 
ment and the possibility of an American -encouraged coup. The article 
goes on by describing the martial law measures being taken by the Vietnamese 
government to prevent a coup, and then gives a brief biography of several 
generals who might stage the coup. Thus the crux of the article is to 
describe the tenuous political situation in Viet Nam, not to discuss astrology. 
The paragraph quoted above thus serves merely as a light introduction* 

Construction of document vectors from the foil text of articles such 
as this could very well result in a tremendous amount of spurious information 
in the vector. For this reason, and because of the document length, it is 
necessary to form abstracts* The abstracts used are about one hundred words 
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in length and present the main ideas of the article using much the sane 
vocabulary and constructions as in the original text. The abstracts thus 
capture the gist of the article in both content and style while eliminating 
most of the unrelated chaff. Using these abstracts, a vocabulary is con- 
structed and document vectors are formed using standard SMART dictionary 
construction and vector creation programs. The dictionary assigns a single 
concept number to all words with a common stem. Figure 10 presents the results 
of a normal SMART search with the TIME subset collection. The results are 
consistent with retrieval results using other collections. There thus 
seems to be nothing particularly unusual about this document and query set 
which might tend to diminish the significance of any experimental results. 

Three sets of phrase experiments arc performed using the TIME subset 
collections. The first two are the implicit and relational as presented 
earlier. As before, various parameters are used to weight the importance of 
a phrase match in the correlation calculation. A third phrase process called 
half relational is also used. This is a weaker form of relational phrase 
matching (heretofore referred to as full relational for clarity'. In full 
relational, a phrase match occurs only when the document and query have the 
same concept pair and the concepts are joined by the sane relation. In Figure 
11 below, the query phrase QP matches only document phrase DPI. In half 
relational matching, a match occurs when the document and query share a 
concept which occurs in die sane relational context in both vectors. For 
example in Figure 11, the query £F matches document phrases DPI, 2, and 3 
but not 4. While the query concept matches in DJ 4 r the relational context 
does not. That is, in QP concept 5 is a modifier while in DP4 it is 
modified. Thus as the name ir.rlics, half rolati nal matches require only one 
° r *“ wo relatod concepts tc match. This U : clearly i weaker. rut chine 
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could be of value in cases where cooccurrences of whole phrases are rare; 
but it nay also give many improper matches. 



QP <5, 7,M0D> 

DPI < 5, 7,M0D> 

DP 2 < 5 , 9,M0D> 

DP 3 <13, 7,M0D> 

DP 4 < 3, 5,MOD> 

Sample Query and Document Relations Phrases 
Figure 11 



The results for these experiments are shown in Figure 12 A, B, and 
C. Figure 13 gives the tabulated results for each method using the weighting 
parameter which provides the best results. While these represent the best 
values, the results achieved for other parameter values are only very slightly 
lower. As before the figure shows whether the results of the phrase experi- 
ment are above (+) or below (-) those achieved when no phrases are used. 

These results reveal that implicit phrase matching is harmful to retrieval 
quality and gets worse as the weighting parameter is increased. Half relational 
shows some slight improvement for low recall values while full relational is 
generally worse. However in these latter two methods, all differences are 
very small and effectively insignificant. 




B) Analysis of Results 

The most surprising result of this set of experiments is the harmful 
effect caused by implicit phrases. This is inconsistent with the results 
obtained with the ADI collection. This apparent turnabout can be explained 
by recalling the original purpose for using implicit phrases. This is to 
separate those documents whose correlation is based on a cooccurrence of 
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RECALL 


STANDARD 


P = 0.5 


TIME IMP LI 
P » 1.0 


CIT PHRASES 
P =1.5 


P = 2.0 


0.1 


.6426 


.6333- 


.5639- 


.5635- 


.5635- 


0.2 


.6426 


.6333- 


.5639- 


.5635- 


.5635- 


0. 3 


.5537 


. 5778+ 


.5639+ 


.5635+ 


.5635+ 


0.4 


.5500 


.5361- 


.5125- 


.5135- 


.5135- 


0.5 


.5500 


.5331- 


.5125- 


.5135- 


.5135- 


0.6 


.4781 


.4604- 


.4447- 


.4429- 


.4429- 


0. 7 


.4217 


.4215- 


.4256+ 


.4183- 


.4183- 


0.8 


. 3745 


. 3652- 


. 3564- 


. 3579- 


. 3579- 


0.9 


. 3702 


.3577- 


. 3555- 


. 3496- 


.3496- 


1*0 


. 3669 


.3577- 


. 3555- 


.3496- 


.3496- 



Sundry of TIME Implicit Phrase Experiments 



Figure 12A 
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RECALL 


standard 


TIME FULL RELATIONAL PHRASES 






P= 0.5 


P= 1.0 


P-1.5 


P =2.0 


0 . 1 


.6426 


,6389- 


,5359- 


.6333- 


.6333- 


0.2 


.6426 


.6389- 


,6359- 


.6333- 


.6333- 


0. 3 


. 5537 


.5500- 


,5803+ 


.5778+ 


.5778+ 


0.4 


.5500 


.5417- 


.5215- 


.5190- 


.5190- 


0.5 


.5500 


.5417- 


.5215- 


.5190- 


.5190- 


0.6 


.4781 


.4614- 


.4520 


.4578- 


.4634- 


0. 7 


.4217 


.4079- 


.4041- 


.4099- 


.4154- 


0.8 


. 3745 


. 3632- 


. 3602- 


. 3577- 


. 3577- 


0.9 


. 3702 


. 3632- 


. 3602- 


.3577- 


. 3577- 


1.0 


. 3669 


. 3632- 


. 3602- 


. 3577- 


.3577- 



Summary of TIME Full Relational Phrase Experiments 



Figure 12B 
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recall 


STANDARD 


TIME HALF RELAT! 
P = 0. 5 I P = 1.0 


[ONAL PHRASEJ 
P = 1.5 


P = 2.0 


0.1 


. 64°6 


.6274- 


.6274- 


.6274- 


.6663+ 


0.2 


.6426 


.6274- 


. 6274“ 


.5857- 


.6107- 


0. 3 


. 5537 


.5218- 


.5163- 


.5718+ 


.6107+ 


0.4 


.5500 


.5218- 


.5112- 


.5649+ 


.5788+ 


0.5 


.5500 


. 5218- 


. 5112- 


.5649+ 


.5788+ 


0.6 


.4781 


.4448 


.4362- 


.4468- 


.4468- 


0. 7 


.4217 


.4111- 


.4062- 


.4111- 


.4111- 


0.8 


. 3745 


. 3395- 


. 3350- 


. 3259- 


. 3198- 


0.9 


. 3702 


. 3395- 


. 3350- 


. 3259- 


. 3198- 


1.0 


. 3669 


.3372- 


. 3327- 


.3236- 


. 3175- 



Summary of TIME Half Relational Phrase Experiments 



Figure 12C 
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RECALL 


STANDARD 


IMPLICIT 
P= 0. 5 


FULL 
P = 0. 5 


HALF 
P = 2.0 


0 . 1 


.6426 


.6333- 


.6389- 


.6333+ 


0*2 


.6426 


.6333- 


.6389- 


.6107- 


0.3 


.5537 


. 5778+ 


.5500- 


.6107+ 


0.4 


.5500 


,5361- 


.5417- 


.5788+ 


0.5 


.5500 


.5361- 


.5417- 


. 5788+ 


0.6 


.4781 


.4604- 


.4614- 


.4468- 


0.7 


.4217 


.4215- 


.4079- 


.4111- 


0.8 


. 3745 


.4652- 


. 3632- 


. 3198- 


0.9 


. 3702 


. 3577- 


.3632- 


. 3198- 


1.0 


.3669 

_ 


*3577- 

i . 


.3632- 


. 3175- 



Summary of TIME Processes 
Best Results Used for Each 
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several concepts in the document and query from those docuxne' ■ s whose correla- 
tion results from one or two highly weighted concepts. In the ADI collection, 
there are many concepts in the documents with weights of twenty-four or 
more sc that there is a real need for such a separa:±jn technique. As a 
result, implicit phrases provide improved retrieval for the ADI. In the 
TIME collection occurrences of highly weighted concepts are much rarer 
than in the APT. Consequently the reason for using implicit phrases does 
not exist. Employing the phrase technique thus does not accomplish the 
purpose for which it is designed and hence no improvement is realized. Thus 
it appears that implicit phrases may be a useful technique but only when 
used with collections which meet certain requirements as to the presence 
of highly weighted concepts. 



analysis are discouraging. They may be the result of weakness in the phrase 
process or, as in the case of the ADI collection, they may be caused by the 
collection itself. Figure 14 shows for each method how many phrases are 
matched with relevant and nonrelevant documents. In both cases only about 
one- third of the phrase matches are between a query and one of the relevant 
documents. This seems to indicate that the weakness may lie in the phrase 
matching method, however this is only partially true. The reason for the 
poor results for the half relational is simply that the matching criteria 
are too weak. Too many false and incorrect phrases are matched and the lower 
retrieval quality results. It therefore seems the half relational method 
is worthless although some further testing is necessary to finalize the 
decision. The reason for the poor results with the full relational method 
is not so clearly the fault of the matching scheme. Of the 82 phrase 



The results achieved using both half and full relational content 
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matches between documents and queries, 65, or roughly 80% , are matches 
of the phrases "South Viet" or "Viet Nam". Since the entire collection 
deals with South Viet Nam, these phrases occur almost uniformly throughout 
the document set. And since each query has an average of three times as 
many nonrelevant as relevant documents, the results in Figure 14 are to 
be expected. If this document collection were considered as one cluster 
of a larger collection, the phrase South Viet Nam would be useful in 
gaining access to the cluster. However, within the cluster it is a poor 
discriminator and thus cannot help retrieval. If South Viet Nam is removed 
from the set of phrase matches, more than two-thirds of the remaining phrase 
matches occur between a query and a relevant document and retrieval would 
clearly be improved. However the small number of relations that remain 
seem to indicate the same collection sparseness as is found in the ADI and 
Cranfield collections. 





Number 1 

With Rel 
Documents 


of Phrase Matches 

With Nonrel 
Documents 


Total 


Half 










Relational 


89 


32% 


186 67% 


277 


Full 










Relational 


28 


34% 


54 65% 


82 



Phrase Matches (TIME) 
Figure 14 



A document- document correlation on the TIME subset collection reveals 
that the average correlation is 0.2. This is twice as high as the ADI or 
anfield and is to be expected since the TIME collection is designed 
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specifically for high density. However , the average maximum correlation 
(AMC) which is a more important measure is 0,41, roughly the same as for 
previous collections. This indicates that the increased density in the 
collection is achieved by the omission of low correlating documents, and 
not by the occurrence of highly correlated document pairs. And this 
collection is seen as no better for phrase experimentation than the ADI. 

Thus is appears that even though this collection is constructed specifically 
for phrase use, it does not satisfy some of the theoretical prerequisites. 

The natural question at this point is exactly in what type of collection 
are phrases useful. This question is treated in the next section. 

Beside collection density, there j.s another factor affecting the 
usefulness of phrases. This is the type of relations occurring between 
text elements. There are basically two types of semantic relations by 
which phrase words may be associated: re versible and nonreversible . A 

reversible relation is one in which the ordering of the constituent w r ords 
has no effect on tho. meaning. For example the words 11 in formation" and 
"retrieval", occurring in almost any structure means "information retrieval", 
and hence the words are related by a reversible relation. A nonreversible 
relation is one in which the phrase structure is significant. The relation 
between "U, s." and "Russia" in the sentence below is an example of a 
nonreversible relaticn. 

The U. S. influences Russia. 

There is also a third type of relation, wdiich is usually a specialized 
subset of nonreversible, called trivial nonreversible . These are phrases- 
whose meaning depends on the structure and are technically nonreversible. 
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However, with these special phrases, all but one of the potential meanings 
do not occur in practice, and the relation assumes reversible characteris- 
tics. For example, consider the sentence: 

The U. S. invades Cambodia. 

Since it :s possible for the U. S. to invade Cambodia and vice versa , the 
relation between u. S. and Cambodia is clearly nonreversible . However, 
since in fact Cambodia has not and probably will never invade the United 
States, the relation is actually trivial nonreversible and hence i 
structure becomes unimportant. As mentioned earlier, one of the primary 
objectives of the use of structured phrases is in matching phrases whose 
meaning is a function cf both its content and its structure, that is, 
phrases with nonreversible relations. If such phrases do not occur in 
the analyzed text, structured phrase use can clearly provide little or no 
help in retrieval. This is the case in the TIME collection. Of the 
phrases isolated, a vast majority are reversible or trivial nonreversible. 
Thus the lack of nonreversible relations is another reason for the failure 
of the content analysis scheme to achieve improved results. 



5. A Third Collection 

In the previous sections it is shuvn that the ADI and TIME 
collections do not require the use cf phrases because they do not demon- 
strate the characteristics which provide the theoretical basis of phrase 
use. Ihey are neither dense enough nor do tney contain large numbers of 
nonreversible relations. And honco no significant advantage is gained 



through the use of phrases. Analysis of other natural collections such 
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this point is this: what is a collection like which has the desired 

characteris tics? To attempt to answer this a purelv artificial collection 
is constructed. The collection consists of twenty documents and fourteen 
queries, each in the form of a short sentence. The subject matter deals 
with tlie relation between birds and worms end is inspired by an example 
by Simmons [81. Tins highly specific subject guarantees a highly dense 
document space. In addition, the documents are specifically written to 
include nonrevers ible relations. For example, in 

Birds eat worms. 

Worms eat grass. 

The words ''worms" and "grass” are clearly nonreversibly related. This 
collection might thus be considered an ideal testbed for phrase experimen- 
tation . 

Results are tabulated in Figure 15 and shown graphically in Figure 

16. Because of the extreme closeness of the Various results, only the 
best of each set is shown. Also the results of implicit phrases are not 
shown on the graph in Figure 16 since they coincide with the no phrase 
results. The lack of improvement here is caused, as in the TIME collection, 
by the lack of highly weighted concepts in the document and query vectors. 
Thus the problem which implicit phrases are designed to solve simply does 
not exist. The results for half relational phrases show a slight improvement 
at all recall levels. More important, however, are the results in Figure 

17. This indicates that only about a third of the half relational phrase 
matches are between a query and one of its related documents. This seems 
to finalize the conjecture stated earlier that half relational matching 
is too weak a criterion and results in too many improper phrase matches. 
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RECALL 


STANDARD 


IMPLICIT 


FULL 


HALF 


0.1 


.8440 


*8440 


.9286+ 


.8810- 


0.2 


. 8440 


.8440 


.9286+ 


.8810- 


0. 3 


.8440 


. 8440 


.9286+ 


.9810- 


0.4 


. 8440 


. 8440 


.9286+ 


.8810- 


0.5 


. 8440 


.8440 


.9286+ 


.8810- 


0.6 


.8083 


.8383 


.9000+ 


. 8524+ 


0.7 


. 7798 


. 7798 


.9000+ 


.8524+ 


0.8 


. 7798 


. 7798 


-8929+ 


.8333+ 


0.9 


. 7548 


. 7548 


.8393+ 


.7554+ 


1.0 


.7548 


. 7548 


.8393+ 


. 7554+ 



Summary of B&W Phrase Processes 



Figure 15 
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Precision 



1.0 



0.6 - 




0.6 

0.4 

0.2 

0 



0.2 



_l ! 

0.4 0.6 

Recoil 



0.8 



-l 

1.0 



implicit (coincides with standard) 

full relational 

half relational 

standard 

B & W Phrase Results 
Figure 16 
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It thus appears to be an unsuitable phrase process. As Figure 17 indicates, 
quite the opposite is true for full relational phrases. More than two- 
thirds of the full relational phrase natches are with relevant documents. 

This fact is also reflected in the improved precision at all recall levels 
achieved by any full relational matching. These results can be treated 
both optimistically and pessimistically. On the one hand, they show 
conclusively that structural phrases can be of value in information retrieval. 
On the other hand, this improvement in retrieval results is not achieved 
in "natural 11 collections such as the ADI, but rather only for one which is 
highly artificial and contrived. It is not clear at this point whether any 
natural collection can meet all of the requirements for advantageous phrase 
use . 

6. Conclusion 

The general conclusions that can be drawn from these experiments are 
that a number of different types of phrase processes are useful in informa- 
tion letrieval provided certain characteristics exist in the document set. 

This is especially true in the case of structural phrases where it appears 
that effective phrase use depends more on the collection than on the sped fee 
phrase process. 



on the cooccurrence of many concepts in the document and query as opposed 
to those correlations which are the result of a very few matches of highly 
weighted concepts. Results indicate that it performs the job quite well. 
However, if the collection has relatively few high weights, the need for 
implicit phrases no longer exists. Using implicit phrares with such 
O :ollections is thus a wasted effort and may even lead to downgraded retrieval 



The implicit phrase process is designed to boost correlations based 
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NUMBER OF 


PHRASE MATCHES 






WITH REL 
DOCUMENTS 




WITH NONREL 
DOCUMENTS 


TOTAL 


HALF 

RELATIONAL 


62 38% 




102 


62% 


164 


FULL 

RELATIONAL 


36 69% 




16 


31% 


52 



Phrase Matches (B & K) 



Figure 17 
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For structured phrases to be of value in information retrieval, a 
number of conditions must be met. First the collection must be sufficiently 
dense, or at least have some dense clumps of documents. Second, the docu- 
ment must contain nonreversible relations. Along the same line, the docu- 
ments in any particular clump must be sufficiently different semantically 
so that conceivcbly some but not all could bo relevant to a given query. 

In other words, there must be a potential need to discriminate between 
closely related documents. This restriction is necessary for the following 
reason. It is conceivable that a particular clump of documents could be 
so closely related that either all or none are related to any query. While 
this clump satisfies the density requirement and may have nonreversible 
relations as well, it does not require the use of phrases. There is no need 
to distinguish among members of the clump and thus phrases cannot help. 
Finally, it is necessary that the queries contain nonreversible relations. 

If fsuch relations are not requested in the query, as is true in the ADI 
collection, no advantage is gained by using them in the documents. Testing 
tli is condition is easy when dealing with experimental documents and 
queries, but clearly impossible in real applications. However, it is 
u ssible to predict the general form for expected queries and thereby 
determine if tney meet the phrase rtquirement. As a general guideline, 
queries are more applicable to phrase use if they are of the question- 
answering variety rather than pure document retireval. 

The fin..! conclusion that is reached from this scudy is that, 
contrary to intuition, phrases do not seem to exert a large effect on a 
user choice of relevant documents. Future work must be done on determining 
the factors that go into a user's relevancy decisions. With more insight 
nto this area, the role of strvr ture in information retrieval will become 

ERJC 

k maff B Haa a i .-tuch more clearly defined. n - 

U 4 i 



I-U6 



References 



[1] Curtice , R. M. , and Jones, P. E. , An Operational Interactive 
Retrieval System, Arthur D, Little, Inc., 1969. 

[2] Douglas, E. , Mandersloot, W., and Spicer, N. , Thesaurus Controx — 
the Selection, Grouping, and Cross-referencing on Terms for 
Inclusion in a Coordinate Index Word List. , Journal of the American 
Society for Information Science, January-February , 1970, 

[3] Hutchins, W. J, , Automatic Document Selection Without Indexing, 
Journal of Documentation, Vol. 23, No. 4, December 1967. 

[4] IBM Systems/360 Document Processing System, Applications 
Description, IBM, 1967. 

[5] Lesk, M. E», A Proposal for English Text Analysis, Bel.. Telephone 
Laboratories, 1969. 

[6] Salton, G. , Automatic Information Organization ancL Retrieval , 
McGraw-Hill, New York, 1968. 

[7] Salton, G, , Automatic Text Analysis, Science, Vol, 168, April 
17, 1970. 

[8] Simmons, R. F, , Synthex, In Orr, W, D. , (Ed.), Co n v ersatio nal 

Compute rs , John Wiley and Sons, Inc. , New York, 1968. 

[9] Weiss, S. F., A Template Approach to Natural Language Analysis 
for Information Retrieval, Ph,D. Thesis, Cornell University, 

Ithaca, New York, 1970. 






ERIC 

hiMMIffTBaaiiilil 

65 



II -1 



II. The "Generality" Effect and the Retrieval 
Evaluation for Large Collections 

G. Salton 



Abstract 

The recrieval effectiveness of large document collections is 
normally assessed by using small subsections of the file for test purposes, 
and extrapolating the data upward to represent the results for the full 
collection. The accuracy of such an extrapolation unhappily depends on 
the "generality" of the respective collections. 

In the present study the role of the generality effect in 
retrieval system evaluation is assessed, and evaluation results are 
given for the comparison of several document collections of distinct sire 
and generality in the. areas of documentation and aerodynamics. 

1. Introduction 

Over the past few years i great many studies have been undertaken 
in an attempt to assess the retrieval effectiveness of a variety of 
automatic analysis and search procedures. Under normal circumstances , 
a single test collection is used which is subjected, to a variety of pro- 
cessing methods; paired comparisons are then made between two or more 
procedures for this collection in order to determine which methods are 
most effective in a retrieval environment. [1,2,3) 

Occasionally, however, it is necessary to use several different 
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document collections in a test situation and to compare the results for 
distinct collections {rather than for distinct processing methods ). Such 
is the case notably when a variable is tested for which a single collection 
is not normally usable (for example, the language in which the documents arx 
written [4]), or when an attempt is made to extrapolate from a small test 
collection to a large operational one* [5] In such situations, special 
precautions are needed to insure that the evaluation measures actually reflect 
the performance differences between the respective collections. 

Consider as an example, two distinct document collections. Performance 
differences might then emerge as a result of the following collection 
characteristics: 



a) 

b) 

c) 

d) 

e) 

and f) 



differences in subject matter; 

differences in the scope of the Collections; 

differences in the document types available for processing; 

differences in query types; 

differences in the collection size; 

differences in the relevance judgments of queries with respect 
to documents. 



In the present study, the first four variables are not under inves- 
tigation in the sense that comparisons are made only for collections of 
document abstracts of similar scope within a specific subject area, using 
standard user requests of the type often submitted to an information center. 
The other two variables, namely collection size and type of relevance 
assessments ar$ of special interest, since both of them affect the evaluation 
results obtained for large operational systems. These variables to a large 
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extent determine the generality of the collection, that is, the average 
number of relevant items per query, and generality in turn affects the 
evaluation parameters. 

In the remainder of this study, two different generality problems 
are examined by using on the one hand collections of different size for 
which the relevance judgments agree, and, cn tne other hand, collections 
of identical size with different relevance properties. The variations 
obtained in the evaluation results are examined, and an attempt is L>ade 
to interpret the respective performance differences. 

2, Baric System Parameters 

The evaluation parameters used to assess the retrieval performance 
of a given sec of user queries with respect to a document collection are 
normally based on a two by two contingency table which distinguishes be- 
tween the documents retrieved in answer to a given query and those not 
retrieved, and between items judgid to be relevant to the query and those 
not relevant, A typical contingency table is ? resented in Table lfa^, 
and four common evaluation measures derived from it are contained in 
Table 1(b). 

Each of the measures listed in Table 1 is initially defined for 
each query separately. However, procedures exist for averaging the 
measures over a complete query set and for suitably displaying the 
resulting values in the form of recall-precision, or recall-fallout graphs. 
(6) These graphs are then expected to reflect the performance of an 
entire system for a given set of users, 

it should be noted that the four retrieval measures are not 
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Re levant 


Not Relevant 




Retrieved 


a 


b 


a+b 


Not Retrieved 


c 


d 


ctd 




ate 


Ltd 


atbtctd 



(a) Contingency Table 



Symbol 


Evaluation 

Measure 


Formula 


Explanation 


R 


Recall 


a 

ate 


proportion of relevant 
actually retrieved 


P 


Precision 


a 

atb 


proportion of retrieved 
actually relevant 


F 


Fallout 


b 

btd 


proportion of nonrelo- 
vant actually 
retrieved 


0 


Generality 


ate 

at'o^etd 


proportion of relevant 
per query 



(b) Principal Evaluation Measures 



Retrieval Evaluation Measures 
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independent ;f each other. Specifically* three of the measures will auto- 
matically determine the fourth. As an example, equation (1) can be 
used to derive precision in terms of recall, fallout, and generality, as 
rcllows : 



P 



R»G 

(R'G) + F(l-G) 



( 1 ) 



Most of the retrieval evaluation results published in the literature 
have been presented in terms of recall and precision. Since lacali pro- 
vides an indication of the proportion of relevant actually obtained as 
a result of a search, while precision is a measure of the efficicnc/ with 
vrhich these relevant are retrieved, a recall-precision output is user- 
criented , in the sense that the user is normally interested in optimizing 
the retrieval of relevant items. On the other hand, fallout is a measure 
of the efficiency of rejecting the nonrelevanl items, and includes as a 
factor the total number of nonrelevant in the collection (which in many 
cases is approximately equivalent to the collection size). For this reason, 
a recall --fallout display is normally considered to be systems - oriented 
since it indicates how well the nonre levant are rejected as a function of 
collection size. 

In view of their special orientation, it would then appear that 
some of the measures are more appropriate in certain circumstances than 
in others: in particular, if a systems viewpoint is important which 

takes into account the amount of work devoted to the retrieval of non- 
re 1*' van t items as well is the collection ^ize, a lallout display may be 
more desirable than a graph based on precision. 

The situation is unfortunately complicated by the fact that the 
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various measures do not vary in the earn* manner when a comparison .is made 
of the performance of several distinct document collections. Consider , as 
as example., the parameter variations produced by changes in collection 
generality. As the generality increases, that is, as the average number 
cf relevant per query grows larger, the riudiber or relevant retrieved may 
also be expected to increase. In terms of the variables introduced in 
Table i, a and a+c may then be expected to grow directly with generality; 
oil the other hand a+L, and b+d (the total retrieved, and the total non- 
relevant) remain relatively constant. 

As G increases, the following picture then emerges for K, P, 
and F, respectively: 




where the uoward arrow denotes c ,n increasing quantity, and the hoi^zontc. 
arrow a quantity more or less constant. Thus, R and F should remain 
reasonably constant with changes in generality, since numerator and denom- 
inator vaiy in the same direction. Precision, on the other hand, should 
vary directly vi ch generality because of the increasing numerator together 
with the constant denominator. 

This kind of argument has been used in the past to show that the 
use of recall -precis, .n graphs is generally undesirable, (7), and to 
claim that performance figures obtained with small sample collections in a 
laboratory environment cannot be applied to large operational collections 
[8]. This question is further examined in the next section. 
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3. Variations in Collection Size 

A) 'theoretical Considerations 

Co: sider a performance comparison for two collections of different 
size within a given subject area. Such collections generally exhibit 
different generality characterist ice , since the larger collection is 
likely to contain on the average many more nonrelevant items per query, 
and therefore proportionately many fewer relevant ones. 

In going from the smaJ ler (test) collection to the larger (opera- 
tional) o : o , two limiting cases may be distinguished: 

a) if the relevance of the documents added to the small 
collection in order to produce the large one is difficult 
tc assess in e clear-cut way, and nonrelevant items that 
are hard or easy to reject are added roughly in the same 
proportion as originally present, then 'or a given level of 
recall a larger number of relevant items will have to be 
retrieved; this will imply the simultaneous retrieval of a 
larger number of nonrelevant, thereby depressing precision, 
but keeping fallout roughly constant; 

b) on the other hand, if the documents added are clearly 
extraneous to the query topics and the nonrelevant ones 
are easily reject able, the number of relevant and nonrele- 
vant retrieved at a given recall level remains constant, 
thereby producing a constant precision but lower fallout 
for the larger collection: the situation is summarized in 
Table 2. 

If case 2 were to occur in practice r that is, if one coulc. insure 
that any nonrelevant documents added to the small collection would be 
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Collection 
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Proportion 
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P 


F f 


P -► 


F 


F ■+ 


F f 



Precision and Fallout Performance for Variations 
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easy to reject, then the standard recall-precision plot would furnish 
a completely adequate evaluation tool, since the precision would then be 
independent of the generality change, and would in fact be identical for 
both collections at each common recall level. If s on the other hand, 
case 1 is taken as typical, then fallout can be assumed to be constant. 

This makes it possible to compute an "adjusted precision" value as a 
function of generality, to account for the generality change in upgrading 
from a small C' llection to a large one. 

Consider, as an example, a document collection with generality G^ , 
and a given precision at a recall level of . If the size of the 
collection is altered to a new generality G^, then, for any given recall 
level, equation (1) can be used to compute the adjusted precision 
for the larger collection. Ill fact, if the generality change is subject 
to the rules of case 1, one has (from equation (1)): 

VS 

P 2 < ad 1 usted) = (1^, g 2 ) + Fjd-Gj) (2) 

where the computations are made for a given recall level , and 

fallout is assumed constant. Equation (2) then provides a means for com- 
puting the precision transformation for the case where all factors other 
than generality remain constant. 

Cleverdon and Keen propose a three-step procedure for effecting the 
precision transformation as follows: [1] 

a) given and compute ; 

b) assume 1*^ = > 

c) given and F^ , compute P 0 . 
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An example for a collection of generality 0.005 and recall and precision 
values of 0.60 and 0.25 respectively is ^hown in Table 3. The precision 
adjusted to a generality level of G = 0.002 is seen to be 0.11. 

B) Evaluation Results 

The theoretical considerations outlined in the last few paragraphs 
indicate that the retrieval evaluation provides an accurate picture for the 
case where the expansion in collection size is caused by the addition to a 
small document collection of clearly nonrelevant items which are easily 
rejectable, and for the case where fallout remains constant* that is, 
where relevant and nonrelevant items are added in a proportion roughly 
equivalent to that which originally existed. 

Unfortunately, when thu assumptions of cases 1 and 2 are tested on 
actual document collections of different generality, they are found not 
to hold in practice, For example, in a test conducted some years ago 
with two document collections of 200 and 1400 documents in aerodynamics , 
respectively, and a sample of 42 queiies, Cleverdon and Keen found for a 
specified cutoff and processing method that 

"b (the nonrelevant retrieved) has increased by a factor of 

5.2352 while the total number of nonrelevant documents in the 

collection (b+d) has increased by a factor of 7. 1443." [1, p»74] 

For the example considered, fallout therefore did not remain constant, 
and many of the nonrelevant included in the larger collection of 1400 
items obviously exhibited a lower probability of being retrieved than the 
nonrelevant included in the smaller subcollection. 
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To verify this result, the two collections originally used by 
Cleverdon were subjected to a complete retrieval test, using a set of 36 
queries with identical relevance properties in both collections (the set 
of relevant items was the same for each query in both collections). The 
collection charactez'istics are summarized in Table 4, and recall-precision, 
as well as recall-fallout, plots are included in Fig. 1, averaged over 
the 36 test queries. 

It may be seen from the output of Table 4 and Fig. 1 that although 
the collection generality decreases by a factor of about seven in the transi- 
tion from small to large collection, the fallout decreases by a factor of 
only three on the average. Thus the proportion of nonrelevant retrieved is 
much smaller for the large collection than for the small one, producing the 
recall- fallout plot of Fig, 1(b) which favors the large collection (the 
smaller the fallout, the better is the performance).* The recall-precision 
plot, on the other hand, favors the small collection (the higher the pre- 
cision, the better is the performance), indicating that at a given recall 
level, fewer nonrelevant will have been retrieved for the small collection 
than for the large one. 

The data of Table 5, containing the average number of nonrelevant 
documents retrieved at various recall levels, indicate that the seven-fold 
decrease in collection generality is accompanied by an increase in the 
average number of nonrelevant retrieved, ranging from a factor of 2 at a 
recall of 0. / a factor of only 3,2 at a recall of 0,3 and 0.5. This 
explains the superior systems-oriented performance of the large 1400 
collection in comparison with the small one. 



*The average number of nonrelevant items retrieved at various recall levels 
^ shown in Taole 5 for the Cranfield 200 and i'OO collections. 



Property 


Cranfield 200 


Cranfield 1400 


Source 


Cranfield document 
abstracts in 
aerodynamics 


Cranfield document 
abstracts in 
aerodynamics 


Document 

Analysis 


Word stem process 


Word stem process 


Number 1 of 
Documents 


200 


1400 


Number of 
Queries 


36 


36 


Number of 
Relevant 
Documents 


160 


160 


Type of 
Search 


Full search 


Full search 


Generali ty 


• 0222 


• 0031 


Average 

Fallout 


.0248 


• 0081 



Collection Properties for Cranfield 200 and 1400 
Table 4 



Recall 


Average Number of Nonrelevant 
Retrieved 


Factor of 
Increase 
from 200 
to l^oo 


Cranfield 200 


Cranfield 1^00 


0.1 


0. 33 


0.67 


2 


0. 3 


1.35 


4.32 


3.2 


0.5 


2.79 


8.82 


3.2 


0. 7 


6.21 


16.15 


2.6 


0.9 


13.89 


30.54 


2,2 



Increase in Nonrelevant Retrieved 
from Cranfield 200 to Cranfield 1400 




Table 5 
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In practice, it is seen that the larger the collection (and therefore 
the smaller the generality), the larger will he the number of nonrele- 
vant items which will have been retrieved at any given recall level; 
however c e resulting decrease in precision performance is much smaller 
than expected by the factor of increase in collection size and nonrele- 
vant item; added. Neither of the two simple generality transformations 
discussed in the preceding subsection appears to be applicable in practice, 
since both precision and fallout may be expected to decrease with a 
decrease in collection generality. 5 ' 

C) Feedback Performance 

It is known that interactive search methods in which the user 
influences the retrieval process by providing appropriate feedback infor- 
mation during the course of the operations can be used profitably in a 
retrieval environment. [10,11] In fact, some of the feedback methods 
which have been tested over the last few years, including, in particular, 
the rele v ance feedback process regularly used with the automatic SMART 
document retrieval system, provide anywhere from five to twenty percent 
improvement in precision at a given recall level. Most other refinements 
Jn retrieval methodology — such as, for example, a particularly 
sophisticated language analysis scheme — may bring improvements in per- 
formance of the order of a few percent at best. 

The relevance feedback process utilizes user relevance judgments 



T f the precision transformation of equation (1) were (incorrectly) to be 
applied to the precision performance of the small collection to reduce its 
generality that of the large collection (.0031), the adjusted precision 
curve of Fig, 7 would result. This adjusted precision is an inverse ^unction 
of fallout, which accounts for its inferior performance compared with that 
f the large collection. 
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Recoil-Precision Plot for Cron 200 and Cran 1400 Codec 
(Precision Adjusted to Generality of .0031) 
(averages over 36 queries) 
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for documents previously retrieved by an initial search in order to 
construct an improved query formulation which can subsequently be used in 
a new ,T first iteration", or ’’second iteration" search. Specifically, an 
initial search is performed for each request received, and a small amount 
of output, consisting of some of the highest scoring documents, is pre- 
sented to the user. Some of the retrieved output is then examined by the 
user who identifies each document as being either relevant (R) or not rele- 
vant (N) to his purpose. These relevance judgments are later returned to 
the system, and used automatically to adjust the initial search request in 
such a way that query terms present in the relevant documents are promoted 
(by increasing their weight), whereas terms occurring in the documents 
designated as nonrelevant are simi]arly demoted. This process produces 
an altered search request which may be expected to exhibit greater simi- 
larity with the relevant document subset, and greater dissimilarity with 
the nonrelevant set. 

The altered request can next be submitted to the system, and a 
second search can be performed using the new request formulation. If the 
system performs as expected, additional relevant material may then be 
retrieved, or, in any case, the relevant items may produce a greater 
similarity with tr>e a^ered request that with the orig: v 1 . The newly 
r3 trieved items can again be examined by the user, ar - vance 

assessments ca^ be used tc obtain a second ref cumulation e request* 

This process can be continued over several iterations, until such time 
as the user is satisfied with the results obtained. 
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j'n order to determine whether the relevance feedback proc«ss is 
usable with large document collections in an operational environment, the 
feedback procedure was tested using two collections in aerodynamics of 
different generality. [12] If comparable feedback impro ements could be 
obtained for collections of varying size and generality, then it would appear 
reasonable to conclude that the feedback process will be valuable under 
operational conditions. 

The two collections being tested consist of 200 and 424 document 
abstracts in aerodynamics, respectively, together with 22 queries with 
identical relevance properties in both collections. The collection char- 
acteristics are summarized in Table 6, and the recall-precision and recaJl- 
fallout graphs obtained with a "positive" feedback strategy are shown for 
both collections if Figs. 3 and 4, 

It may be noted that once again the recall-precision output favors 
the small collection, whereas the recall- fallout output is more favorable 
to the larger collection. Furthermore, while the generality decreases by 
a factor of over 2 from small to large collection, the fallout drops by 
less than one-half. These results are entirely in agreement with those 
previously obtained for the Crunfield 1400 collection. The output of 
Figs. 3 and 4 for the positive feedback strategy also indicates that the 
magnitude of improvement pi s ovided by one feedback iteration is approximately 
comparable for th; wo collections. 

In order to investigate the question of feedback improvement in 
more detail, several feedback procedures wer* tested including, in parti- 
cular, the folluwing three types (based on the retrieval of the + op five 
documents in each case); 
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Property 


Cranfield 2C0 


Cranfield 424 


Source 


Abstracts in 
aerodynamics 


Abstracts in 
aerodynamics 


Analysis 


Word stem process 


Word stem process 


No. Documents 


200 


424 


No. Queries 


22 


22 


No. of Relevant 


115 


115 


Search 


Feedback search 


Feedback search 


Generality 


.0261 


. 0123 


Ave. Fallout 


,0333 


.0211 



Collection Properties for Feedback Searches 
Using Cranfield 200 and 424 



Table 6 
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Precision 




Recall-Precision Comparison for Cran 200 and 424 Collections 

(initial run and one feedback Iteration -positive 
feedback only, wo r,H stem process, 22 queries) 




Fig. 3 
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Fa I lout 





Recall-rollout Graph for Cron 200 or ' 424 

(Initial run and one feedback iteratk. 
positive feedback only) 

Fig. 4 
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a) positive feedback, where information obtained from docu- 
ments known to be relevant is used to update the query 
formulation ; 

b) selective negative feedback, where positive information is 
derived from the relevant documents together with negative 
information obtained from the top retrieved nonrelevant item; 

c) modified selective regative feedback, where the negative 
information derived from the nonrelevant documents is used 
only when no positive information is available. 

The evaluation is based principally on two evaluation functions, 



which measure respectively the precision improvement and the fallout 
improvement as follows : (12] 



of the feedback iteration at a specified fixed recall point; am 
Fallout improvement = F^ - , 

wh^re Fq is initial fallout, and the fallout of the feedback iteration, 

(A performance improvement implies that the fallout for the feedback iteration 
is smaller than the initial fallout.) 



The output for a selective negative feedback strategy which does 



not operate satisfactorily in an environment of decreasing generality is 
shown in Fig. 5 , It is seen that for the larger collection the precision 
improvement is negative for most recall points, showing that the feedback 
proce;-*3 in fact hurts the performance. The same is true for some points of 
the fallout improvement curve. Apparently, the strategy represented by 



Precision improvement = P - , 



where is the precision of the initial search, and is the precision 
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Improvement 



11-23 



O 

o 

© 

(£ 




O 

o 

CVJ 

c 

o 

o 



cm 

c 

o 

w 

O 

* 



in _ 



c 
© 
e 
© 

O =* 

£ c* 
a © 

£ o 

— w 

4 - 

— CO 
-O 

— © 

- > 
3 4 - 

O O 
— cn 
o © 
Z 



(A 

© 

c 

© 

3 

O’ 

c\j 

CM 



l 



© 

> 

o 

v> 

© 

O' 

o 

•w 

© 

> 



in 

d» 

iZ 



*o © 
c > 
o ♦- 
o 

0-2 s 

V- * <D 

c W 

o 

w 

6 

© 



o 

o 

© 

a: 




O 

ERIC 



83 



11-24 



the curves of Fig. 5 uses too many nonrelevant items for feedback purposes 
thereby hurting retrieval. (Fewer relevant items are retrieved early in the 
search for r he Cran 424 collection, than for Cran 200.) 

The performance for two feedback strategies which operate excellently 
with decreases in generality is shown by the precision and fallout improve- 
ment curves of Figs. 6 and 7. Fig. 6 covers the positive-feedback strategy 
which is seen to operate equally well for both collections. Still larger 
improvements are noted in Fig, 7 for the modified negative strategy in 
which a nonrelevant item is used for feedback purposes only when positive 
information (in the form of relevant retrieved documents) is not available. 

From the output of Figs. 6 and 7 it appears that feedback strategies 
can be implemented which operate equally veil for collections of low and 
high generality . These strategies should be implementable in a realistic 
environmen 4 : comprising thousands of items where they may be expected to 
produce the performance improvements previously noted for smail test collec- 
tions. 

4, Variations in Relevance Judgments 

A generality problem arises not only when collections of different 
size but identical relevance properties are to be compared, but also when 
the same collection is processed with different types of relevance assess- 
ments. In a previous study, a collection of 1268 documents in library 
science and documentation was examined using four types of relevance grades: 

a) the A judgments representing relevance assessments by the 
query authors; 

b) the B Judgments representing nonauthor judges; 
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c) the C judgments representing the disjunction between the 
A and B judgments (that is, a document is judged relevant 
to a query if either A or B judges termed it relevant); 

d) the D judgments representing the conjunction between A ana 
E judgments (a document is judged relevant if both A and B 
judges termed it relevant). 

It was demonstrated in the previous study [13} , that the recall-precision 
performance graphs are relatively invariant to the variations caused by 
the multiple i^elevance assessments, and by the resulting changes in 
generality. 

In an attempt to determine whether the performance characteristics 
obtained with collections of different size can be related to those pro- 
duced by collections with varying relevance pioperties, the C and D 
collections are processed once again under slightly modified conditions. 

The collection properties are outlined in Table 7. 

It will be noted that in the present case the generality change is 
produced not by adding any documents to the C collection in order to obtain 
the other collection of lower generality, but rather by subtracting from 
the set of relevant documents a number of items about which a unanimity 
of opinion could not be obtained by the relevance assessors. Nevertheless, 
the performance figures given in Table 7, and in Fig. 8(a) show that 
once again somewhat better recall-precision data for the collection of 
high generality (the C collection) are coupled with somewhat better 
fallout data for the collection of low generality (the D colle ;tion ) . '•* 

This reflects the fact, on the one hand, that precision varies somewhat 




-The recall-precision figures shown in Fig, 8(a) are not directly comparable 
to those produced in the earlier study [13] because of a small difference in 
the method used to produce performance averages over the total number of queries. 
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Property 

Source 

Analysis 
No. Documents 
No. Queries 
No. of Relevant 
Search 



IsnT'S C 

Document abstracts 
in documentation 

Thesaurus 

1268 

45 

1260 

Full search 



Ispra D 

Document abstracts 
in documentation 

Thesaurus 

1268 

45 

306 

Full search 



Generality 
Average Fallout 



« 0241 
• 1409 



.0058 
• 0819 



Collection Properties for Ispra C and D Collections 



Table 7 
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with generality, and therefore the collection with higher generality is likely 
to produce better precision. On the other hand* the collection of low 
generality exhibits better relevance judgments, since at least two judges 
had to agree on the relevance of each document; there exists therefore 
a greater certainty about the relevance (or nonrelevance) of each document 
with respect to each query, which implies that the nonrelevant are easier 
to reject using the D relevance judgments. 

In order to see how the performance data change under a generality 
transformation, the C collection with high generality (.0241) is reduced 
to the generality of the D collection (.0058) in two different ways: 



a) collection C mod 1 is produced by taking 962 relevant docu- 
ments chosen at random and calling them nonrelevant; this 
reduces the original set of 1260 relevant documents in C 

to a total of 306 relevant (equal to the number of relevant 
in D); 

b) collection C mcd 2 is produced by retaining 306 out of the 
1260 originally relevant items; the remaining 962 formerly 
relevant items are assigned random ran v s in the collection 

instead of being retained with the rank they initially 
possessed as in C mod 1).* 

The performance of the modified C collections which now exhibit 
the same generality as the standard D is presented in the recall-precision 




graphs of Fig. 8(b). It is seen that when the generality is kept invariant, 
as it is for the three collections of Fig. 8(b), the collection with the 
most reliable relevance judgments (the standard D) produces the best per- 
formance. Of the two modified C collections obtained oy the generality 



*ihe reranking process followed is described in a note by Williamson. [14] 
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transformation 5 the second produces better output than the first, since it 
is more carefully constructed by randomly deleting relevant items, and 
then randomly reintroducing them as r.onrelevant ones with new ranks. 



5 . Summary 

A variety of retrieval tests were performed with collections of 
varying generality in the areas of aerodynamics and documentation. Since 
precision varies with generality, the precision output generally favors 
the (small) collection of high generality. However, as the generality 
drops by a factor of k, the precision drops by a much smaller factor, and 
the fallout, which had been thought to remain invariant with generality 
changes, in lact decreases with generality, and thus favors the (large) 
low generality collections. 

No clear extrapolation appears possible at this time which would 
permit a prediction to be made about the likely performance of very large 
collections of several hundred thousand items. However, the fallout 
data obtained in this stu ly make it clear, that an argumentation which claims 
that the retrieval of 20 nonrelevant items for a collection of 1000 items 
would necessarily lead to an expected retrieval of 20,000 nonrelevant for 
a collection of a million is fallacious, since it assumes a constant 
fallout performance. 

The user feedback procedures appear to useful foV collections 
of varying generality, and they should be implemented in operational 
environments. Finally, when generality variations arise from inconsistencies 
in the relevance assessments, the collection with the most secure relevance 
data performs oest. 
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As larger document collections cone into experimental use, the 
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fallout and precision figures should continue to be compared with the 
generality variations. In this fashion, it may be possible, in time, 
to obtain reliable projections for the performance with Large collections 
under operational conditions. 
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III. Automatic Indexing Using Bibliographic Citations 

G. Salton 



Abstract 

Bibliographic citations attached to technical documents have been 
used variously to refer to related items in the literature, to confer 
importance to a given piece of writing, and to serve as supplementary 
indications of document content. In the present study, citations ar* 
used directly to identify document content, and an attempt is made to 
evaluate their effectiveness in a retrieval environment. It is shown 
that the use of bibliographic citations in addition to the normal keyword- 
type indicators produces improved retrieval performance, and that in some 
circumstances, citations are more effective for retrieval purposes than 
other more conventional terms and concepts. 

1. Significance of Bibliographic Citations 

The role of bibliographic citations attached to scientific and 
technical documents has received intensive study for many years. Several 
authors have noted, in particular, that the number of incoming citations 
(that is, the number of citations from a given set of outside documents 
to a specified target document) constitute useful indicators of document 
type and importance [1,2]. In consequence, the so-called ,T bibliographic 
network" consisting of documents and citations between them has been used 
to assess the characteristics of scientific and technical communications. [3] 

In addition to providing indications of document influence, 
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b;Dliographic citations also play a role as content identifiers. Vhe 
close affinity between the citations attached to a given document and the 
normal keyword-type content indicators has been expressed by Garfield 
in the following terms [4]: 

"By using the author's references in compiling the citation index, 
we are in reality using an army of indexers, for every time an 
author makes a reference, he is in effect indexing that document 
from his point of view..*." 

Furthermore, only a very small proportion of documents appears to Is totally 
disconnected from the bibliographic network, in the sense that these docu- 
ments do not cite any other documents nor are they cited from the outside [3]: 

"...there is a lower bound of one percent of all papers that are 
totally disconnected in a pure citation network, ar.d could 
be found only by topic indexing...." 

As a result, search tools such as the "citation index 11 which lists all 
incoming citations for each document in the index have proved to be useful 
adjuncts to information search and retrieval. 

A variety of studies have been undertaken in an attempt to deter- 
mine the relationship between standard keywords and bibliographic citations 
for content analysis purposes. Thus, it was determined that papers which 
were related by similarities in bibliographic citation patterns also pro- 
vided a large number of common subject identifiers. [5] Furthermore, the 
correlation between citation siroilai'it ies on the one hand, and index term 
similarities on the other is found to be far gieater than expected for 
random document set* * [6] 

While bibliographic citations appear not to have been used directly 

o 
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as content indicators for retrieval purposes up to the present time, a 
number of experiments have been performed in which citations were incorporated 
as feedback information during the search process, in an attempt at retriev- 
ing additional information similar to that being identified in the search. 

[7,8] Specifically, an initial search would be made, leading to the retrieval 
of a number of documents. These would be scanned by the user, and information 
about these documents — including in particular document authors, citations 
made by the documents, and authors of these citations — would be returned to 
the system to be incorporated into an improved search formulation. The 
evaluation of this bibliographic feedback process proved, in particular, that 
[ 8 ]: 



"...no differences greater than four percent were found between 
the results of feeding back only subject data, and those of 
feeding back only bibliographic data. This implies that the 
usefulness of bibliographic data for feedback is of the same 
order as that of subject descriptors/ 1 

In addition, the same study showed that when citation data were added to 
standard subject indicators in a feedback environment, improvements of up 
to ten percent in retrieval effectiveness were obtained over and above 
the results produced by subject information alone. This led to the con- 
jecture that [ 8 J : 

"Sirica the bibliographic information is useful for feedback 
purposes, it should also prove valuable for initial retrieval 
seai ches. 11 




An attempt is made in the remainder of this study to evaluate the 
correctness of this statement. Specifically, a collection of 200 documents 
in the field of aerodynamics is processed against a set of M2 queries using 
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first the normal content analysis methods Incorporated into the automatic SMART 
document retrieval system [9] , and then a modified process based on the 
bibliographic citations attached to the documents. The test design and evalua- 
tion results are covered in the remaining sections of this report. 



i 2. The Citation Test 

t 

l 



1 Consider a given document collection available in the form of English 

| language abstracts, together with a corresponding set of user queries. Given 

such a collection, various linguistic analysis procedures may serve to reduce 
each item into analyzed vector form. A concept vector , representing either a 
document or a query, normally consists of a set of terms, or concepts, together 
with the respective concept weights. Two of the content analysis methods most 

i 

! frequently used with the SMART retrieval system are the word stem , and the 

thesaurus processes. In a word stem analysis, each concept incorporated into 
a normal concept vector represents a word stem extracted from the document, 
whereas for the thesaurus procedure, the concepts represent thesaurus categories 
obtained by consulting an automatic dictionary during the analysis operations. 

Word stems, or thesaurus categories are then concepts somewhat similar to the 
standard subject indicators normally assigned manually to queries and documents. 

In such an environment, the normal retrieval operation would consist in matching 
the concept vectors for queries and documents, and in retrieving for the users' 
attention all documents whose vectors exhibit a reasonable degree of similarity 
with the corresponding query vectors. 

i If it is assumed that each document carries with it a set of bibliographic 

citations (either to or from the document), it is possible to add to the normal 
document concept vectors, suitably chosen codes representing the bibliographic 
I citations; alternatively, the citation codes might replace the normal concepts, 

IerIc 

r hmtBiffgfiiaaa 1 
. l'Jf- 1 
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In order to obtain a match between citation codes attached to documents 
and normal user queries, it becomes necessary to attach citation informa- 
tion also to the queries. This can be done in one of two ways: 

a) some queries may have been formulated by the user population 
in response to a set of documents known in advance to be 
relevant i that is, for each query one or more source 
documents exist, and the user’s query is designed to 
retrieve additional items similar to the respective source 
documents*; 

b) alternatively , a source document does not exist in advance, 
but the user is able to designate some other document as 
likely to b) relevant to his query. 

In either case, it becomes possible to add to the query vectors citation 
codes corresponding to source document citations, or to citations attached 
to the designated relevant documents, as the case may be. 

These operations then produce expanded query and document vectors 
consisting partly of standard concept codes, and partly of citation codes , 
as shown schematically in Figure 1. Three types of retrieval operations 
become possible: 

a) using only standard subject identifiers (the 1 x 1 concepts 
of Figure 1 ) ; 

b) using only citation concepts (the *y f concepts of Figure 1); 

c) using both the standard and the citation concepts (the 1 x' 



*In a previous test in which original query formulations were replaced 
by source document vectors, it was shown that the retrieval effective- 
ness produced by the source document "queries" was substantially better 
than that obtained with the standard queries, [10] 




103 



m-e 



xxxxxxxxxxxxxx 



y y y y y y 



norrr.al document 
concepts 



bibliographic 

citat- 



a) Typical Expanded Document V*. 



x x x x x >: 



y y y y y y 



normal query citations to 
concepts source document 



b) Typical Expanded Query Vector 



Expanded Query and Document Vectors 



! o 
ERLC 



Figure 1 



1 1 1-7 



and ’y 1 information). 

In these - ’ rcumstances , the relative value of the citation information may 
be ascertained by comparing the results obtained with these three types of 
concept vectors. 

For the test under discussion, a collection of 200 document abstracts 
in aerodynamics was used with 42 search requests obtained from research 
workers in aerodynamics (the Cranfield collection [11]). Each document carried 
an average of 18 bibliographic references (outgoing citations to other 
documents), and each query was originally formulated in response to a source 
document. The set of source documents were similar in nature to the standard 
documents, in the sense that bibliographic citations were available for each; 
however, no source document was included among the standard 200. 

To generate the citation portion of the document and query sectors, 
each citation was represented by a 15-character code. The citation coding 
is outlined in Figure 2, and some encoded sample documents are exhibited in 
the appendix. In order to increase the similarity coefficient for all 
documents cited by the query source documents, a citation code was added to 
each document vector not only for all outgoing citations, but also for each 
of the original documents. That is, each document is assumed implicitly to 
cite axso itself (self-citation). A match between a query citation concept 
and a document citation concept may then be due to one of two causes: 




a) a request citation (source document citation) is identical 
with the document itself (l'equest cites document); 

b) a request citation is identical with a citation from a 
document (request and document have a common citation). 

A comparison between citation effectiveness and standard concepts 
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is obtained as usual by computing recall and precisic , *alues for the 
various runs while comparing the output. The performance results are 
described in the remaining sections of this study. 

3. Evaluation Results 

The computation of recall and precision results depends on the 
availability of l^levance assessments stating the relevance characteristics 
of each document with respect to each query. The original ("A") relevance 
assessments for the Cranfield collection were obtained by first submitting 
to the query authors for assessment the set of all documents cited by the 
source document, followed by additional items likely to be relevant. Since 
the source document citations were thus given special treatment , a bias may 
exist in favor of these citations — that is, an item cited by the source 
document may be more I'kely to ~e assessed as relevant than other extraneous 
documents. For this reason, three additional sets of relevance judgments 
were independently obtained from nonauthor subject experts, for which all 
documents were treated equally; that is, no special identification was 
provided for source document citations. The characteristics of the four 
sets of relevance assessments are summarized in Table 1.** 

It may be seen that the four types of relevance assessments fall 
into two main categories as follows; 



^ Recall Is the proportion of relevant documents retrieved, and precision is 
the proportion of retrieved items actually relevant. Ideally one would like 
to retrieve all relevant and reject all nonrelevant to produce recall and 
precision values equal to 1. When recall is plotted against precision, as in 
a standard recall-precision graph, curves closs to the upper right-hand corner 
represent superior performance, since both lecall and precision are then maximized. 

L% 

° ” available the Cranfield 



a) sets A and B have low generality characteristics — only four 
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Relevance Judgments 


Generality 
{Average Number of 
Relevant per Query) 


Percent Overlap 
with "A" Judgments 

fA r\ x] 

l/u xj 


Original Judgments 
"A 11 


4. 70 


100.00% 


r 'B" Judgments 


4.28 


80.74% 


"C" Judgments 


11.94 


37.09% 


"D” Judgments 


11.70 


37. 83% 



Relevance Assessments 
Table 1 
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to five relevant items per query — corresponding tc a strict 
interpretation of relevance; furthermore the A and B assess- 
ments are very similar in nature in view of the overlap of 
over 80 percent in the respective sets of relevant items 
per query; 

b) sets C and D exhibit much higher generality — almost 12 relevant 
items per query — * corresponding to a less narrow relevance 
interpretation, and the similarity with the original A 
judgments is much smaller. 



Under normal circumstances, one would expect a better recall-precision 
performance for the high-generality case, while for equivalent generality, 
the best relevance assessments would produce the best performance [12] . 

The actual retrieval effect of the four types of relevance assessments is 
outlined in the graphs of Figure 3. 

It may be seen that when citations only are used in query and document 
vectors (the ’y* portions), the low generality A and B assessments give 
much superior performance (Figure 3 (a)). On the other hand, when standard 
thesaurus concepts are used in addition to citations, as in Figure 3(b), 
the differences among the four types of assessments largely disappear. The 
same is true when the thesaurus alone is used for analysis purposes (without 
the additional citations). The latter results are in agreement with earlier 
studies showing that only minor differences occur in averaged recall-procision 
graphs with normal variations in relevance assessments. [13] The large 

n 

differences in the performance of the citations only” run of Figure 3(a) 
must then be due to the peculiar nature of relevance assessments 'A' and r B r , 
and to the special treatment accorded to the source document citations during 
the relevance judging procedure. For practical purposes, it appears safer 
O to use the 'C' and 1 D I judgments in assessing the relative importance of 
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citation date: ard standard subject indicators in a retrieval envirc nment . 

The main output results are shown ir. Figure 4 fcr both 'A 1 and 'C r 
relevance assessments. It may be seen that in both cases the augmented 
thesaurus vectors, obtained ly adding citation concepts to standard subject 
indicators, improve the precision performance by up to ten percent for a 
given recall point. The short "citations only" vectors provide superior 
performance for the 'A* relevance assessments for the reasons clready stated. 
Even with the 'C f judgments, the citation indexing alone provides a very 
high standard of performance in the low’ recall range. 

The usefulness of bibliographic citations for content analysis 
purposes is further illustrated by the output of Figure 5 in which a standard 
word stem matching process is compared with the word stem vectors augmented 
by citation information. It can be seen from the output Figure 5(a) that 
the augmented stem vectors generally produce better performance than the 
standard word stems; this confirms the results obtained in Figure 4 for the 
thesaurus process. Furthermore, the output of Figure 5(b) shows that 
augmented thesaurus vectors are slightly preferable to augmented word stem 
vectors . 

The performance data of Figures 3 to o were obtained by adding source 
document citations to the normal query formulations. Since the source 
documents exhibit especially strong relevance characteristics — each user 
knows in advance that the source documents are immediately germane to the 
information queries — an attempt was made to relax the requirement for 
source document citations by replacing them by the citations attached to a 
randomly chosen relevant document. 

Specifically, each query is first processed in the standard manner 
sing a normal thesaurus look-up procedure. A document identified as relevant 
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Effect of Citations on Word Stem Matching Process 
(200 documents, 42 queries; source documents) 
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after the fact — but not known to the user 1 in advance — is then used i r lieu 
of the normal source document, and citations from this relevant document are 
used to form the augmented query vector. The relevant documents chosen for 
this purpose are eliminated from the document collection for evaluation purposes. 
The output of Figure 6 shows that the citations obtained from the randomly 
chosen relevant documents do not have sufficiently strong relevance charac- 
teristics to lead to an improved retrieval performance over and above the 
standard thesaurus method. 

The following principal results emerge * rom the present citation test: 

a) the general usefulness of bibliographic citations for document 
content analysis, previously noted by a number of other investi- 
gators, is entirely confirmed; 

b) bibliographic citations used for document content Identifica- 
tion provide a retrieval effectiveness fully comparable to 
that obtainable by standard subject indicators at the low 
recall-high precision end of the performance range; 

c) the augmented document vectors, consisting of standard concepts 
plus bibliographic citation identifiers appear to provide a 
considerably better retrieval performance than the standard 
vectors made up of normal subject .ulicators only; 

d) the bibliographic citations attached to information requests 
should be taken from documents whose strong relevance cftaracter- 
istics to th . respective queries is known in advance by the user 
populat ion. 

The present experiment then leads to the conclusion that documents 
processed in a retrieval system should normally carry bibliographic citation 
codes in addition to standard content indicators. V.'hcn queries are received 
from the user population, improved service can be obtained by using 
, y* ' lament citations as part of the query foi nulat ions whenever documents with 
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Use of Citations from Random Relevant Documents 
(200 documents, 42 queries) 
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a priori rexevance characteristics are identified by the users at the time of 
query submission. If no documents with strong relevance characteristics are 
available when the query is first received, bibliographic citations can still 
be used as a feedback device by updating the query formulations with citations 
from previously retrieved relevant documents. 




no 



References 



J. H. Westbrook, Identifying Significant Research, Science, 

Vol. 132, No. 3435, October 28, I960, p. 1229-1234. 

J, Margoiis, Citation Indexing and Ev .luation of Scientific 
Papers, Science, Vol. 155, No. 3767, March 10, 1967, p. 1213- 
1219 . 

D. J, de Solla Price, Networks of Scientific Papers, Science, 

Vol. 149, No. 3633, July 30, 1965, p. 510-515. 

E. Garfield, Citation Indexes for Science, Science, Vol. 122, 

No. 3159, July 15, 1955, p, 108-111. 

M. M. Kessler, Comparison of the Results of Bibliographic 
Coupling and Analytic Subject Indexing, American Documentation, 
Vol. 16, No. 3, July 1965, p. 223-233. 

G. Salton, Associative Document Retrieval Techniques using 
Bibliographic Information, Journal of the ACM, Vol. 10, No, 

4, October 1963, p. 440-457. 

J. W. McNeill and C. S. Wetherell, Bibliographic Data as an 
Aid to Document Retrieval, Scientific Report No. ISR-16 to 
the National Science Foundation, Section VIII, Cornell University, 
September 1969. 

M. Amreich, G. Grissom, D. Michelson, E. Ide, An Experiment in 
the Use of Bibliographic Data as a Source of Relevance Feedback 
in Information Retrieval, Report No. ISR-12 to the National 
Science Foundation, Section XI, Cornell University, June 1967. 

G. Salton, Automatic Information Organization and Retrieval, 
McGraw-Hill Book Company, New York, 19 66. 

R. G. Crawford and H. Z. Keizer, The Use of Relevant Docui ents 
instead of Queries in Relevance Feedback, Scientific Report No. 
ISR-14 to the National Science Foundation, Section XIII, Cornell 
University, October 1963. 

C. V/, Cleverdon and E. M. Keen, Factors Determining the Perfor- 
mance of Indexing Sustems, Vol. 2, Test Results, Aslib CranfieLd 
Research Project, Cranfield, 1966. 

G. Salton, The Generality Effect and the Retrieval Evaluation 
for Large Collections, Scientific Report No. ISR-18 to the 
National Science Foundation and to the National Library of 
Medicine, Section II, Cornell University, October 1970. 

M. E, Lesk and G- Salton, Relevance Assessments and Retrieval 
System Evaluation, Information Storage and Retrieval, Vol. 4, 

No. 4, October 1968, p. 343-359. 



I 11-20 



Appendix 

Sample Citation Codes 



I. Sinnott, Colin S. , "On the Prediction of Mixed Subsonic/Supersonic Pressure 
Distributions 9 m Journal of Aerospace Sciences , Vol. 27, p. 767, 1960. 



SINOJAS27076760 



II, Herriot, John G, , Blockage Corrections for 3-Dimensional Flow Closed-Throat 
Wind Tunnels, with Consideration of the Effect of Compressibility, NACA , 
Rep, 995 a 1950. ' 



HERN AC AO RO 9 9 5 5 0 



111, Cheng, H. K. , "Hypersonic Shock-Layer Theory of the Stagnation at Low 
Reynolds Number," Proceedings of the 1961 Heat Transfer and fluid 
Mechanics Institute , Stanford University Press, Stanford, California, 1961. 



CHEPHTF00HYPE6I 



IV. Couper, J.E., The Operation and Maintenance of Recorder Type IT 3-16-61 , 
Unpublished M.O.A. Report. 



C0UUNPU0PERAT** 



V. Goldstein, S, » Ed., Modern Developments i n Fluid Dynamics , Vol. 1, 
p. 135; Oxford, The Clarendon Press, 1936. 



o 

ERIC 



GS9I.K0 DE RUDE 0 i 3 8 



IV- 1 



IV. Automatic Resolution of Ambiguities from Natural Language Text 

S. F. Weiss 



Abstrac t 

This study investigates automatic disambiguation by template analysis. 
The evolutionary process by which ambiguities are created is discussed. 

This leads to a classification of ambiguities into three classes: true, 
contextual, and syntactic. The class assigned to a given word is 
dependent on the syntactic and semantic functions performed by the word. 

Only true ambiguities are suitable for automatic resolution. 

In this study, automatic disambiguation is accomplished by an 
extended version of template analysis. The process consists in locating 
an ambiguous word and in testing its environment against a predetermined 
set of rules for occurrences of words and structures which indicate the 
intended interpretation. Experiments using this process show that a high 
degree of accuracy in resolution can be achieved. 

The process under consideration is not completely automatic 
because it requires that a set of disambiguation rules be created a priori . 

The creation of this rule set, however, is sufficiently straight forward 
that it may eventually be done automatically. A learning program is imple- 
mented to accomplish this. The process reads input words and attempts to 
resolve any existing ambiguities. If a resolution of the ambiguity is 
performed incorrectly, the rule set is augmented and modified appropriately, 
and the next input is considered. 

The experimental results obtained are poor for the first few inputs. 
The performance steadily improves as more inputs are processed, and finally 
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levels off at above 90% accuracy. A true learning process is thus indicated. 

The proposed learning process is not only useful for disambiguation, 
but can also serve for a number of other applications, where it may be desired 
to tailor a process to a particular user need. 



1. Introduction 

An ambiguous word is defined as a word which can have two or more 
different meanings. There exist a great many such ambiguous words and their 
occurrence in text is fairly common. In general they create no problem for a 
human reader because he is constantly aware of the context of the material 
he is reading and of the real world. This usually makes obvious the proper 
definition of an ambiguous word. For example, the word BOARD may mean, among 
other things, a piece of wood or a group of people. In the first of the two 
sample sentences below, the ambiguity is resolved by the context of the sen- 
tence while in the second, resolution is achieved by the reader's knowledge 
of the real world. In other words the reader knows from his general knowledge 
that it is much more likely to cut a piece of wood than a group of people, 
even though it is technically possible to do both. 



As He is a member of the board of directors. 
Bs He cut up the board . 




Disambiguation by computer is considerably more difficult. A computer does 
not automatically conceptualize the context of the text as if- is read. Also 
a computer cannot be expected to contain the vast store of knowledge that a 
human reader possesses. This study presents some techniques fo" automatic 
semantic disambiguation of words from natural language text and the application of 
template analysis to this process. A complete discussion of template analysis 
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is presented in Weiss [16) . 

The justification for such a study is that ambiguities in text are 
detrimental to any natural language process which uses that text. The 
extent of the damage imposed by ambiguities varies with the natural language 
process as is shown by the three examples below. 

1. In a SMART-like information retrieval system ambiguous words 
are assigned multiple concepts to represent their various 
possible definitions. Since only cne of the definitions is 
is actually correct, this process adds erroneous material to 
the document and query vectors. But this is not a serious 
problem since ambiguous concepts are rare and thus make up 
only a small part of a document or query vector. Resolution 
of ambiguities makes a very small change in a concept vector 
and hence causes only a very small change in document -query 
correlations. Thus in a retrieval environment, ambiguities 
may not pose a very serious problem and are hardly worth 
resolving* Examples 2 and 3 present environments in which 
the consequences of ambiguities are more serious and dis- 
ambiguation is more justified. 

2. A serious problem in automatic syntactic analysis is that an 
analyzer may produce many analyses for a single input. It 

is very difficult if not impossible to determine the intended 
analysis from among this set. Thus syntactic analysis schemes 
which generate as few analyses as possible are clearly the most 
desirable. One cause of multiple analyses is words which have 
more than one syntactic role. For example, the word FLYING 
can be either a verb or an adjective. This in turn gives 
rise to several analyses of 

THEY ARE FLYING PLANES. 

Some systems perform semantic tests to determine which of 
the syntactic analyses is semantically feasible. An even better 
approach is to resolve ambiguities prior to syntactic analysis 
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thus reducing the number of analyses produced. It sometimes happens 
that syntactically ambiguous words are also semantically ambiguous. 
NEGATIVE for example is usually an adjective when it means NOT and 
a noun in the photographic context. Thus by resolving the semantic 
ambiguity, the syntactic ambiguity is also removed. In this way 
resolution of semantic ambiguity can reduce the numoer of analyses 
resultant from an automatic syntactic analysis scheme and hence 
simplify the task of determining the correct analysis. 

3. In natural language command analysis or a natural language programming 
language, each statement must be mapped into a unique command or 
command sequence. Statements which due to ambiguities simultaneously 
specify more than one command sequence are unexecutable. Current 
programming languages such as FORTRAN and ALGOL deal with this 
problem simply by prohibiting all but the most trivially resolvable 
ambiguities (such as the minus sign which may be unary or binary) . 

This is not possible in natural language command analysis and thus 
all ambiguities irust be resolved before execution is possible. 

These three examples show how the problems caused by ambiguities in 



natural language text vary according to the application. In the third example 
resolution is a necessity while it is more or less a convenience in the other 
two. In general it appears that at best, ambiguities do no harm and at worst 
they are disastercus. In no case do they ever seem to have constructive effects. 
Of course there are other examples of consequences of ambiguities but these 
three seem sufficient to justify further investigation into the area of 
automatic disambiguation. 

2. The Nature of Ambiguities 

>5ost words in isolation do not have a well defined meaning. The exact 
meaning of a word is formed by the interaction of the word and its context, 
h word is both acted upon by its context and acts upon its context. The 
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action that a word performs on its context is called its semantic funct? on . 
This can be thought of as a mathematical function with the word's context 
as its argument and the total meaning as its value. An example is presented 
in Figure 1 below. 



Phrase: Bottom of the bottle 

Word: Bottom 

Semantic function: indicates lowest point in context 

Context: "of the bottle 1 ' 

Application of semantic function to context yields the 
value: lowest point in the bottle 

Example of Semantic Function 
Figure 1 



Building on the concept of semantic function it is now possible 
to define three types of ambiguities. A word is a true ambiguity if it 
has two or more distinct semantic functions. An example is the word 
DEGREE. This may refer to a unit of temperature or angle as in "a 90 
degree turn" or an award from a school as in "college degree". These are 
clearly two separate semantic functions. Some words have only one 
semantic function yet still appear ambiguous. This situation is produced 
when a single semantic function, acting on a variety of contexts, produces 
vastly different meanings. Such words are termed contextually ambiguous . 

As an example, the word CORE is considered ambiguous in the ADI dictionary. 
It refers to both a computer memory and the central part of something, 
however there is only one semantic function at work here and it designates 
central aspect of its context. A computer memory is at least 
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conceptually if not physically the center of a computer. Thus CORE is a 
contextual ambiguity according to the definition above. 

A third type of ambiguity is syntactic ambiguity . The meaning of such 
a word is dependent upon its syntactic role. The meaning of ELABORATE, for 
example, differs somewhat depending on whether it is used as a verb or adjective. 
These differences in meaning, however, are generally just slight variations of 
a single semantic concept. 

The classification of an ambiguous word into one of these categories 
is not a strictly defined process. The categories are not completely disjoint; 
and the ambiyuous words themselves are in a constant state of evolutionary 
change much like biological evolution. A good example of the development of 
an ambiguous word can be seen in the word BOARD which can mean a piece of 
wood, a group of people (board of directors), or food (room and board). 

Originally board referred only to a piece of wood or a table. Because of 
their close relation to the table, the people who met there and the food 
served on it became associated with the board. In time this connection 
disappeared and BOARD currently appears to have three separate meanings. In 
general, ambiguities seem to stem from idioms and associations due to 
similarities such as between the food and the table on which it is served. 

These words gradually evolve into contextual and finally true ambiguities* 

Many of the words currently considered contextually ambiguous may eventually 
become true ambiguities. For example, it is conceivable that in the future, 
computer memories may no longer be considered a central element of the machine 
Thus CORE, shown previously to be a contextual ambiguity, may become a true 
ambiguity. As another example, consider the word LUNACY, It was originally 
thought that this form of insanity was caused by the moon and hence the name, 

d 

however, the lunar influence is better understood, and there is no 
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connection between the disease and the moon. Thus the common steri 



LUNA represents an evolved ambiguity. 

Before considering resolution of ambiguities, it is necessary 
to decide which type or types can and should be resolved. There are 
several criteria for this decision. First, does the resolution of the 
ambiguity add any additional information to that already known? Second, 
does the added information warrant the work involved to determine it? 

And finally, what harmful effects might be expected if the ambiguity were 
not resolved? 



ambiguity vary only slightly. Thus very little information is added if 
resolution is performed. Also, harmful effects caused by syntactic ambiguities 
are slight and occur only in special cases as is shown in the following 
example. Let A, B, and C be words with A syntactically ambiguous and 
having meanings in thesaurus classes 1 and 2 fsee Figure 2) . B and C are 
not ambiguous. B is in thesaurus category 1 and C is in 2. Leaving A 
unresolved, that is using only a single concept to represent A, would in 
effect combine categories 1 and 2. This would make B appear synonymous to 
C which is not really the case. However/ as shown previously, the 
differences in meaning of the various forms of syntactic ambiguities are 
slight thereby necessitating categories 1 and 2 being very close in meaning. 
Thus combining B nnd C is not a particularly grave error. For this reason 
it appears unwarranted to resolve syntactic ambiguities. 



As shown above the meanings of the various forms of a syntactic 



WORDS 



THES. CATEGORIES 



A 



1,2 (SYN AMB) 



B 



1 



C 



2 



Sample Syntactic Ambiguity 
Figure 2 
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As discussed previously, contextual ambiguities have only one semantic 
function. The differences in meaning are caused by the context rather than 
by the word itself. It is therefore questionable whether such words should be 
disambiguated at all. Also because contextual ambiguities derive much of their 
meaning from context, they may have a broad spectrum of meanings rather than 
the few discrete meanings possessed by most true ambiguities. Intuitively at 
least this seems to indicate that the resolution of contextual ambiguities is 
both more difficult and less precise than resolution of true ambiguities . 
Experiments in this area show this to be the case. 

The remaining class, the true ambiguities, demonstrates the properties 
necessary to justify their resolution. The remainder of this study deals with 
techniques for automatic resolution of true ambiguities. 



3. Approaches to Disambiguation 

Many automatic natural language analysis systems have a facility for 
automatic disa ftbiguation. For some this entails the use of semantic information 
to resolve syntact c ambiguities and hence reduce the number of syntactic 
parses. Other systems actually tackle the problem o r true semantic ambiguities. 
This section discusses some of these approaches to automatic disambiguation. 

The easiest solution to the problem is simply to ignore it. This 
approach is actually not as absurd as it initially appears. When the domain 
of discourse is sufficiently limited, many ambiguities disappear. This is the 
case with the information retrieval system implemented by Dimsdale and Lamson 
[3]. By limiting the subject area to the medical field, the problem of ambiguities 
solves itself. For example, the word CELL has a number of possible meanings 
(dry cell, jail cell, muscle cell). However, only one of these interpretations 
q - " appropriate to medicine; and thus in this context, CELL may be treated as 
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As mentioned previously, one possible application for automatic 
disambiguation is in indexing documents for information retrieval. There 
are a number of possible techniques. Some researchers, for example Ranganathan 
[10] and Mandersloot et. al. [4] r suggest that ambiguous words be represented 
by a number of concepts which resolve the ambiguity. One of these additional 
concepts could be the hierarchical father of the word under consideration. 

For example, the ambiguity caused by the Word TYPE could be resolved by 
adding the concept for PRINTING. SMART uses a different method. An 
ambiguous word is assigned the concepts of all its possible interpretations. 

The set of concepts then share the total weight. Thus SMART covers all 
possibilities and is guaranteed of having the correct concept, however it 
is also guaranteed of having some wrong concepts. This inclusion of 
error would appear to weaken the indexing scheme and hence damage retrieval; 
but this is not the case. The occurrence of ambiguous words is quite rare 
and hence the error introduced by the process represents only a very small 
part of a total concept vector. Thus the effect on results is very small. 

In addition problems can only be caused when a thesaurus is used that contains 
words which are synonymous to some but not all of the interpretations of an 
ambiguous word. Actual experiments reveal that the resolution of 
ambiguities in SMART concept vectors results in improvement of less than 
1%. Thus the added effort required to resolve ambiguities in this type 
of information retrieval context seems unwarranted. 

Some question-answering systems with a restricted data base are 
able to disambiguate simply by testing the various interpretations against 
the data base and choosing the one that is applicable. DEACON is an 
example of one such system [15). A query such as the one below is ambiguous 
Guam is an island and an aircraft carrier. But since DEACON's data 
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base deals with ships, the latter interpretation is chosen. 

How many people are on Guam? 

Other systems perform a similar type of disambiguation by using lists of 
true predicates. Coles' system, for example, tests the query against a set 
of truth values. Similarly the process used by Schank and Tesler tests various 
ambiguous interpretations for consistency with a set of real world attributes. 

Another basic method for automatic disambiguation is to associate 
semantic features with each word in the lexicon. Rules, similar to syntactic 
rules, can then test various possible interpretations for semantic as well as 
syntactic wellformedness. One such system is Simmon?' IKOTOSYNTHEX [12], 

Each word is associated with its semantic class. For example, "angry" is a 
type of emotion and "pitcher" is a type of person (baseball player) or a type 
of container. Ambiguities such as "pitcher" are resolved by testing its 
syntactic structure ugairst a set of semantic event forms. These indicate 
possible valid relationships between semantic classes. The semantic event 
forms reveal, for example, that a person can have an emotion while a container 
cannot. Thus the disambiguation of "angry pitcher" is accomplished. Woods 
accomplishes disambiguation in much the same way. Syntactic and semantic 
features are attached to words; and rules indicate legitimate combinations of 
these features. 

Lesk uses a similar approach in his proposed natural language analysis 
system, but with a unique statistical feature [7). In his system words are 
assigned both syntactic and semantic role indicators by the dictionary. The 
parse then determines syntactic dependencies and tests them for semantic 
validity. Those interpretations which fail the semantic test are eliminated 
thus accomplishing some di sambiguatior . In addition, each interpretation 
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of each ambiguous word has associated with it the probability of the 
"correct] less " of that interpretation. For example, in a .sports text 
the word ''base'’ would be much more likely to refer to a baseball base 
than to a military base; and probabilities may be assigned accordingly. 
During the syntactic analysis a number of possible parses are developed. 

The probability of correctness for each is the product of its constituent 
probabilities. In this way, interpretations with very low probabilities 
of being valid may be eliminated thus accomplishing another form of 
disambiguation . 

The processes presented above use syntactic and semantic features 
to qualify the words and then employ a common rule list to govern word 
combination, A more detailed approach to disambiguation is to attach 
specific combination rules to each word. The need for this can be seen in 
the following simple example. Most noun phrases consisting of an adjective 
and a noun assume the basic features of the noun. The phrase may then be 
used anywhere that the noun is legal. For example, the phrase ''folding 
money 11 may be used wherever ''money" can be used. This is not true for 
"folding" which in some sense loses its identity when combined with the noun 
Most of the systems which use a combination rule list can determine this 
property. There are, however, exceptions to this rule. Consider the phrase 
"Tompkins County". Here the word "Tompkins", acting as an adjective, 
dominates the phrase. It is all right to say "Buffalo is in a county" but 
"Buffalo is in Tompkins County" is semantically and geographically incorrect 
Thus in this case the phrase assumes the properties of the adjective. To 
treat properly this and other similar cases, it is useful to associate 
combination rules with individual words rather than using a common rule list 
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for all words. Some of the automatic systems which employ this approach are 
tho5-e by Kellogg [6], and Quillian [9). 

Kellogg's scheme assigns a set of data structures to each interpretation 
of each word. These include semantic features and selection restrictions. 

For a particular word the selection restrictions limit the words with which 
it can be associated to only those with specific semantic features. For 
example, the verb "talk" can take only an animate subject. 

In Quillian* s Teachable Language Comprehender , memory is represented 
as an interconnected network of nodes. The meaning of a phrase is determined by 
locating a path in the network from one constituent word to the other. For 
some phrases there are more than one legal path. This indicates an ambiguous 
phrase. Disambiguation is achieved by using the shortest path* This represents 
the most likely interpretation and is thus similar in approach to Lesk's 
statistical scheme. 

The processes discussed so far deal with disambiguation as a tool in 
some sort of information retrieval or question-answering facility* Moyne [8] 
summarizes this type of disambiguation as falling into one of four interaction 
types: interaction with the lexicon, with i-he data base, with the general 

system capabilities, and if all else fails, interaction with the user. This 
last type is strictly a last resort measure but is very helpful when unresolvable 
ambiguities are encountered. 

As shown above, much of the work in disambiguation deals with larger 
information retrieval and question-answering systems. But some work has also 
been done on ambiguities alone. In particular is the work by Stone [14], Coyaud 
(2], and Forillo and Virbel [1]. All these schemes are based on resolution of 
ambiguities by examination of semantic context* Associated with each word is 
q ~et of words and concepts which, if found near the ambiguous word, specify 
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a particular interpretation. Stone concentrates on the resolution of 
ainbiguities in high frequency words such as "matter 1 *. The study by Borillo 
and Virbel represents the most detailed and complete discussion of dis- 
ambiguation encountered in the literature. They discuss all forms of am- 
biguities, and present for each, the methods needed for resolution. Ambiguous 
words are divided into five classes: 



1, key word 

2, grammatical ambiguity 

3, semantic ambiguity 

4, combined semantic and grammatical 

5, forced 



The key words are words of vat i able importance whose resolution is not vital. 
The forced words are so important that all interpretations must be repre- 
sented. The remainder are self explanatory. The third and fourth classes 
are most interesting and correspond roughly to the true ambiguities presented 
in the previous section. Resolution is achieved by examing some environment 
of the ambiguous word for certain structural or semantic clues. In addition, 
Borillo and Virbel give a suggested list of attributes for a disambiguation 
process. These are first, that the context of an ambiguous word should be 
scanned in closest to farthest order. Second, resolution rules should be 
weighted according to their probability of correctness. And third, the scope 
of the context should be variable from word to word. 

Building on this introduction, the next sections present an automatic 
disambiguation scheme using the template analysis process. It is designed 
as a disambiguation package for a natural language conversational system 
and hence expected input is clearly restricted, in addition, each ambiguous 




A is treated separately and the relevant context of each word is quite 
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4. Automatic Disambiguation 

A) Application of Extended Template Analysis to Disambiguation. 

Associated with each ambiguous word is a set of keywords or structures 
which identify the intended meaning. For example, if within the context of 
the word BOARDj there are references to "fir", "pine", or "oak 1 ', a wooden 
board is probably intended. If "chairman" or "meeting" occurs, board would 
be taken to mean a group of people. This key to the intended meaning of 
am ambiguous word is usually found in the immediate context of that word, 
often in the same sentence. The actual optimal scope of context varies from 
word to word. Borillo and Virbel indicate that in general, best results are 
obtained using large sentence groups {document abstracts) . In some cases, 
however, this is too broad and permits erroneous resolution by matching the 
wrong key. For this reason the scope of context is defined here to be the 
sentence containing the ambiguous word. Each sentence containing an ambiguous 
word is scanned for a resolution ke y. This resolution key may be a word, 
group of words, or structure, which reveals the intended meaning. The process 
is implemented using an extended version of template analysis [16] . This 
section discusses the extensions to template analysis that are required to 
facilitate automatic disambiguation. The disambiguation process is presented 
in subsection B and the experimental results in subsection C. 

A template is basically a string of words. It matches a natural 
language input only if a substring of the input matches the template elements 
exactly including ordering and contiguity. Many ambiauities may be resolved 
using templates; but for others, templates are too strict a criterion. For 
these words the presence of a resolution key anywhere in the input is sufficient 
to warrant resolution. For this reason the context rule is used. Like a 

the context rule is a string of words. However a context tule is 
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considered to match an input if the input contains all the words of the rule 
with no restriction on ordering or contiguity. in Figure 3 below, the template 
matches only input A while the context rule matches A, B, and C. Thus a 
context rule represents a purely semantic test while a template requires 
both semantics and syntax {structure} . 

The process used for matching the input against both templates and 
context rules is a middle-outward search strategy. That is, the search begins 
at the ambiguous word and extends outward in both directions. This guarantees 
finding the resolution key which lies closest to the ambiguous word. This is 
necessitated for two reasons. First, if an input contains two or more 
occurences of a particular ambiguous word, each must be paired with its 
closest resolution key in order to obtain correct results. The examples 
in Figure 4, though admittedly rather contrived, demonstrate the need for 
this technique. 

B) The Disambiguation Process 

The process of disambiguation requires the following elements t 
a small thesaurus of words needed in the disambiguation process, a se ; of 
templates and a set of context rules. The process first reads S’* input 
and each word is looked up in the disambiguation thesaurus. Most words 
are not found and are classified as unknown. The input is first matched 
against the template set and then the context rule set using the middle-out 
search strategy* Disambiguation is performed by the first rule successfully 
matched. The rules in each set are ordered so that the strongest rules, 
that is the ones that are expected to provide the best disambiguation 
performance, appear at the top. The weaker or last resort rules appear 
near the bottom. Scanning the rule list top to bottom matches strong rules 
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Template: COMPUTER PROGRAMMING 


Context rule: 


s COMPUTER PROGRAMMING 


Inputs: A. 


Elements of computer programming 


B. 


Programming of digital computers 


c. 


Computer design and programming 


Template matches A only 


Context rule 


matches A, B, and C 



Comparison of Templates and Context Rules 
Figure 3 



Input A. 


It was very cold when he received his college 
degree. 


Action: 


COLLEGE rather than the temperature reference 
must be used to disambiguate DEGREE. 


Input B, 


His college degree was to a large degree, well 
earned. 


Action : 


Each DEGREE must be associated with its nearest 
resolution key. 



Search Strategy (Underline Indicates Resolution Key) 

Figure 4 
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ensures that an input is matched with the rule that has the greatest 
chance of providing a correct analysis* Associated with each rule is the 
meaning appropriate to that resolution key. If no match is found between 
an input and any rule, the ambiguity is considered uniesolved. An option 
may be used in connection with such unresolved inputs. For some ambiguous 
words one interpretation is much more likely than all the rest. For these 
a significant saving i.. the size of the rule sets and in the work involved 
can be obtained by testing for all but the most likely interpretation. If 
no matches occur the result is taken by default to be the most likely meaning. 
This option is used for some of the experiments that follow. 

C) Experiments 

After classifying the ambiguous words found in the ADI* dictionary 
as true, contextual or syntactic, five true ambiguities are chosen for experi- 
mentation. The words are: 

DEGREE 

TYPE 

VOLUME 

BOARD 

CHARGE 



For each word except DEGREE a corpus of 50 sentences is used. A larger corpus 
is used for DEGREE to provide a more exhaustive test. Each corpus contains 
all sentences from the ADI documents which contain the ambiguous word as 
well as other sentences written by the author and other informants. Each 
corpus is divided into two sets: S-l , called the creation set , and S-2, 




The ADI Collection is a set of short papers on automation and scientific 
communication published by the American Documentation Institute, 1963. 
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called the test set . S-l contains 20 sentences, S-2 contains the remainder 
of the corpus. The experimental procedure used for each word is as follows. 
First, using S-l only, a thesaurus, template set and context rule set are 
created by hand. The disambiguation program is then run on S“l, Appropriate 
additions ana modifications are made to the thesaurus and rule sets, and the 
program is tried again. This continues until the process provides a high 
degree of success ir\ resolving ambiguities from S-l, The thesaurus and rule 
sets existing at this point are thus effectively tuned to the creation set 
S-l, Next, and without further modification of the thesaurus or rules, the 
disambiguation process is run using S-2 as input. The process is thus tested 
on an input set it has never seen before, and one to which it is not 
specifically tuned. The result parameters used are shown in Figure 5 below, 
Kesolution recall indicates what proportion of the total number of ambiguities 
in the input set are correctly resolved, while resolution precision indicates 
what proportion of the analyses performed by the system are correct. In order 
to perform satisfactorily, the process must give reasonably high values for 
both RR and RP. In the optimal case both values are 1, The results obtained 
for the five S-2 sets appear in Figure 6 along with totals for all five words. 
The default option is used in the analysis of TYPE and CHARGE. Inputs for which 
the system does not perform an analysis for these words are taken to be of a 
particular interpretation. Thus no inputs are considered unanalyzed 
(indicated in Figure 6 by an asterisk in the U column) . 

These results indicate that extended template analysis is a useful and 
accurate technique for resolution of true ambiguities. The errors which do 
occur are not, in general, generated by inputs with normal constructions. 

Rather they are due mostly to idiomatic expressions which are not included In 
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T 


The Total number of ambiguities in the input set, 
is sometimes larger than the number of sentences in 
set because a few of the sentences contain multiple 
of the ambiguous word) . 


n 

(This number 
the input 
occurrences 


C 


The number 


of ambiguities correctly resolved 






I 


The number 


of ambiguities incorrectly resolved. 






U 


The number 


of ambiguities net resolved in any way. 






HR 


Resolution 


Recall = C/T 






RP 


Resolution 


precision = C/(C+I) 







Result Parameters 
Figure 5 



' ' 

WORD 


T 


C 


I 


U 


RR 


RP 


DEGREE 


92 


84 


4 


4 


.92 


.93 


TYPE 


30 


29 


1 


* 


.97 


.97 


VOLUME 


30 


27 


1 


2 


.90 


.96 


BOARD 


30 


22 


0 


8 


.73 


1.00 


CHARGE 


32 


30 


2 


* 


.94 


.94 


TOTAL 


214 


192 


8 


14 


.90 


.96 



* Indicates default used 



Results of Disambiguation of s-2 Sets 
Figure 6 




13 i 



IV-20 



in the creation set. As an example the expression ON BOARD is not in 53-1 
for BOARD. This in turn leads to a number of inputs in the test set being un- 
analyzed. While such idioms in natural language may prevent perfect dis- 
ambiguation quality, they occiu relatively infrequently in practice and thus 
reduce the system performance only slightly, 

D) Further Disambiguation Processes 

A number of further processes are suggested by the experiments 
performed here. First, a statistical weighting can be attached to each 
resolution. This would represent the probability of correctness of the 
given rule. The context of the ambiguous word could then be searched for 
all# not just one# resolution key. For each key found, a correlation is 
calculated which takes into account the probability of the rule being correct 
as well as the key’s distance from the ambiguous w‘ord. The rule with the 
highest correlation is then used. In this way a strong resolution key can 
take precedence over a veeK one lying closer to the ambiguous word. 

A second addition is the use of a variable context. All methods 
for disambiguation presented here including those by Borillo and Virbel and 
template analysis use a fixed context size for all words. However the optimal 
context size varies from word tc word. It would thus be better to associate 
with each word# the context, width that works. A third possible future 
technique is to use antirules. These ere rules which if matched, tell what 
interpretation of the ambiguous words cannot be used. For example, if Y 
appears in an input# interpretation X is prohibited even if indicators for X 
are present. These extensions, however, are beyond both the scope and the 
and the spirit of the present study. 
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5. Learning to Disambiguate Automatically 

A) Introduction 

The processes of creatii g and modifying the sets of templates and 
context rules as presented in section 4 are relatively straightf orward 
arid algorithmic in nature* Kules are constructed from creation set inputs 
by fairly specific means. Likewise, in rule modification an erroneous rule 
is removed and replaced by one or more rules which perform correctly. It 
seems possible that these tasks can be handled by computer. Thus instead 
of telling the program what to do by manually supplying rules, the system 
would learn to disambiguate by creating and modifying its own rule sets. 

The advantages of such a system over one of the type described previously 
are obvious. First, it eliminates the need for a human analyst to study 
sample inputs and create template and context rule sets* Second, the system 
is not static. By learning from inputs and its own mistakes it is constantly 
improving its performance. This process can even be used to tailor a 
system to an individual user. Disambiguation rules, or rules for any 
number of other processes, that are designed by or for a particular user 
are not always well suited for other.-. By allowing the system to learn 
separately from each individual, the particular needs of each user are 
satisfied. This section discusses some techniques for automatically learning 
to disambiguate, 

B} Dictionary and Corpus 

When disambiguation rules are prepared by hand, the words which are 
to be used in the disambiguation are known in advance* The disambiguation 
dictionary need only cc itain these relevant words and thus is quite small. 

In the learning process, there is no prior knowledge of the words that are 
i k e used to facilitate disambiguation. For this reason a full dictionary 
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containing all the words in the input must be employed initially. This 
large dictionary, hov73ver, is needed only temporarily. After the initial 
instability of the learning process has settled down and relatively fixed 
rule sets remain, the words in these rule sets may be used to construct a 
small disambiguation dictionary which can be used thereafter* 

The corpora used in ch s study are very special. In practice an 
operational learning system has a very large input set. The learning process 
nay thus extend over hundreds o^' even thousands of inputs. However, such 
large data sets are neither readily available nor practical for an experimen- 
tal system. For this reason it is necessary to develop a small corpus which 
simulates a ir.U'~h larger one. This is a technique used in a number of experi- 
mental studies including Harris’ investigation of morpheme boundaries [5]. 

The rules governing this stem from two fundamental maxims of education* First, 
a student or learning device cannot be expected to answer a question about 
something he has not seen previously. That is, a student’s first exposure to 
a concept must be in a learning not a testing environment. And second, to 
evaluate learning quality, testing is required. Basically these rules say 
that to test properly a learning system, each concept to be learned must 
occur at least twice in the input, oner for learning and subsequently for 
testing. Single occurrences are undesirable because if they are considered 
as a test, they violate the first rule, while if, as the first rule stipulates, 
the single occurrence is considered for learning only, no testing can occur 
and the second rule is violated. Large data collections are likely to have 
multiple occurrences of most concepts. This however is not true for small 
corpora; and care must be taken to ensure such repetition. To accomplish 
this the following algorithm is used for corpus construction for each 



ambiguous word. First, a set of 20 short sentences is written, each containing 
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the ambiguous word. No restriction on vocabulary or construction is 
imposed for these first 20 sentences. Next, 40 more sentences are 
written using only words found in the first 20. Again no restriction on 
construction is imposed. The resulting 60 sentences are sufficiently 
restricted in vocabulary to ensure that most words and constructs occur 
at least twice. The corpus thus simulates a corner of a much larger 
collection* To determine if the system is unlearning previously learned 
information while learning new material, the actual input consists of the 
set of 60 sentences repeated three times* Each seu f S 60 is randomly 
permuted to eliminate any prejudice due tc ordering. The input format 
is summarized in Figure 7. Such corpora currently exist for three 
ambiguous words: 



These are chosen from the set used in previous experiments because VOLUME 
is rather difficult to disambiguate , TYPE is fairly easy, and DEGREE is 
between, tending toward difficulty. It is felt that the results obtained 
and the problems encountered with these words are typical of those to be 
expected for most other words. 



C) The Learning Process 

The learning process is implemented as a set of subroutines to the 



system described in section 4. Dynamic template and context rule lists 
replace the fixed sets. Initially there are no rules in these sots* The 
processing of each input sentence proceeds as follows. After the input is 



DEGREE 
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A: Corpus, permutation 1 
Bs Corpus, permutation 2 
C: Corpus, permutation 3 

Summary of Input Format for Learning System 
Figure 7 
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read and the ambiguity located, the system attempts to disambigue te the 
word using templates and context rules currently in the system. When the 
analysis is complete, the system looks at the correct answer. If the analysis 
is correct, the system is assumed to contain the appropriate rules for the 
recognition of the input structure and the system goes on to the next input. 

If the system is unable to resolve the ambiguity, that is, if no existing 
rule matches the input, new rules must be added, New templates and context 
rules are created using the prespecified parameters I and J. I specifies the 
size of the area around the ambiguous word from which templates are to be 
made. Similarly J indicates the size of the area from which context rules 
are to be made. In general J is larger than I since unstructured resolution 
keys can lie farther away from the ambiguous word than do structured keys. 

For this study J. and J have the values of 2 and 5 respectively, A template 
is made for each word of the input sentence which lies within plus or minus 
I of the ambiguous word. The templates preserve the ordering and the 
relative distance between words. A context rule is created for each word 
within plus or minus J of the ambiguous word provided the word is not found 
on a predefined exclusion list. As indicated previously, context rules have 
no ordering or contiguity restriction. The exclusion list contains words 
which are of no value in establishing context. These include articles, some 
prepositions, forms of the verb TO BE, etc. The list is created by consider- 
ation of context in general and without any reference to specific words being 
disambiguated. The exclusion list is not used in the creation of templates 
because some apparently trivial words are actually important when found in * 
particular structural relationships to an ambiguous word. For example , one 
of the primary templates for the disambiguation of TYPE is 







type of 



The templates and context rules created by this process are first placed in a 
temporary store and checked against rules already in the permanent template 
and context rule sets. All rules in the temporary store which are not 
duplicates of existing rules are added to the bottom of the appropriate 
permanent set. This completes the action for an unanalyzed input. 

The third possible outcome is for the system to produce an erroneous 
analysis. In this case the rule sets not only lack the rules needed for 
correct analysis, but also contain an erroneous rule. Therefore when this 
situation arises, the rule which produces the incorrect result must first 
be removed from the rule set. Each rule lying below the deleted rule is 
then pepped up one position in the rule list. Next, templates and context 
rules are added just as in the previous case. The operation is summarized 
in Figure 8. 

Critical ordering of rules, as is done in section 4 is not possible 
when rules are created automatically. However the process of deleting a 
rule and popping up those below it and then adding the new rules at the 
bottom tends to make the better rules, that is those which do not get deleted, 
filter to the top. While this method may not be as effective as critical 
ordering by hand, it does tend to concentrate the better rules near the top 
of the lists. The top down search strategy thus matches rules against an 
input in roughly best first order. Experimental results which verify this 
ar^ presented later. 

Ideally, a system such as chat described above operating on a corpus 
of the form shown in Figure 7 should generate the following type of resuJ c&. 
The first few inputs are of course unanalyzed due to the lack of information. 
As more inputs are read, the overall system performance shou)d begin a steady 

o 

ement. Eventually the system should stabilize with a fixed rule set 
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Summary of Learning Process 
Figure 8 
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and near perfect disambiguation. From this point the system should never 
unlearn. That is, it should never err on an input that it previously 
analyzed correctly. Likewise it should not be overly sensitive to the order 
in which inputs are introduced. Actual experimental results obtained com- 
pare quite favorably with this idealized behavior. These results are pre- 
sented in subsection E, 

D) Spurious Rules 

The learning process presented in part B has a few inherent 
problems. These center mainly around the treatment of spurious rules. A 
spurious rule is defined to be a template or context rule which does not 
discriminate betv/een interpretations of an ambiguous word. As an example r 
assume that templates and context rules for disambiguation of TYPE are made 
from input A in Figure 9. One of the templates extracted from this input 
is LARGE TYPE. This however is of no value as can be seen from input B. 

Thus LARGE TYPE is considered a spurious rule. 



Input A: The book is printed in large type , 

(interpretation 1, "printing") 

Input B: A tiger is a large ty pe of cat. 

{interpretation 2, "kind or variety") 

Example of a Spurious Rule 
Figure 9 



The difficulty with the process as presented in subsection C (to be 
called version 1 in the remainder of this study) r can be visualized by the 
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following example. Assume rules are learned from input A in Figuie 9. 
Included among these is the spurious rule LARGE TYPE which is associated 
with interpretation 1. Assume also that input B is then processed by a 
match with LARGE TYPE and hence incorrectly associated with interpretation 
1, Version 1 then deletes the interpretation 1 template and substitutes 
one which is identical except for its association with interpretation 2. 

Thus a spurious rule is deleted but replaced with one equally spurious. 

This actually produces a slight improvement since the new rule is inserted 
at the bottom of the list and thus is less likely to be matched than the one 
it replaces. But the spurious rules remain and can cause further errors. 

They may even cause a thrashing back and forth between interpretations and 
thus prevent stability. 

One possible solution to this is implemented in version 2. Whenever 
a rule is to be deleted because it causes an incorrect analysis, the set 
of new incoming rules is checked for an occurrence of this same rule. If 
found, the matched rule is not added to the permanent rule set. Thus using 
version 2, the incorrect analysis of input B would not only remove LARGE 
TYPE from the template set but would also prevent this same template (with 
a different interpretation) from entering the set at that time. In the 
short run this lias the effect of eliminating spurious rules from the system. 
But since no record is kept, these same spurious rules may reentsr the system 
the next time they occur. A reoccurrence of input A follwoing input B 
for example, would put LARGE TYPE back on the rule list. Thus while version 
2 does provide some advantages over ver ;ion 1, there is still room for 
improvement . 

The second modification, version 3, solves the difficulty inherent 
version 2. When spurious rules are located, they are removed from 
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both the rule set and the new entering set as in version 2* But in addition 
the rule is recorded on a list of undesirable rules. All incoming rules 
are checked against the undesirable list. If a match is found# that incoming 
rule is deleted. In this way a spurious rule# once found, is permanently 
prevented from reentering the system. While this process may cause a mild 
retardation in the learning rate due to the decreased number of rules used, 
the slowdown is more than compensated by the increased accuracy of the results. 
The workings of versions 1, 2, and 3 are summarized in Figure 10. 

E) Experiments and Results 

The experimentation consists of processing each of the three corpora 
with the three system versions, a total of nine runs in all. The corpora 
are each 180 sentences in length and are described previously in subsection 
B. The performance measures that are taken are shown in Figure 11. These 
results are tabulated in Figure 12. Figure 13 shows the resolution recall 
and precision for each word calculated at ten document intervals. Averages 
for the results in Figure 13 are presented in Figure 14. These results 
show how the overall system performance improves as more inputs are seen, thus 
indicating a true learning process. These charts also show the general 
superiority o i version 3 over the other two. To indicate this fact more 
clearly, Figure 15 shows the difference in resolution recall and precision 
for the three versions averaged over all corpora. Version 1 is taken as the 
standard and lies on the x axis. Displacement above or below the x axis 
represents superiority or inferiority relative to version 1. These graphs 
show that version 2 and especially version 3 improve both resolution recall 
and precision over version 1. That is, not only do they perform more correct 
analyses than version 1, they also perform fewer incorrect analyses. Usually 
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INPUT 


STATUS AFTER INPUT 




V-l Rule 


V-2 Rule 


V- 3 




Set* 


Set* 


Rule Set Undesirable 








Rule List 


A 


LARGE TYPE 


LARGE TYPE 


LARGE TVPE 




(!)** 


(1) 


(1) 


B 


LARGE TYPE 


- 


LARGS TYPE 




(2)** 






A 


LARGE TYPE 


LARGE TYPE 


LARGE TYPE 




(1) 


(1) 


LARGE TYPE 




(1) 


(1) 





* This chart shows only the part of the rule set that is 
relevant to this discussion. 

** Numbers in parentheses indicate the interpretation 
associated with the rule. 

Interpretation 1 is printing 
Interpretation 2 is kir:d or variety 



Summary of Versions 1; 2, and 3 
Figure 10 
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T The total number of ambiguities in the data set 
C The number of correctly resolved ambiguities 
I The number of incorrectly resolved ambiguities 

U The number of unresolved ambiguities 

RR Resolution Recall - C/T 

RP Resolution Precision = C/(C+I) 



Performance Measures 
Figure 11 
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Figure 13A 
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this results in an increased number of unanalyzed inputs. This is actually 
a very desirable result since if a choice must be made between an input 
laing analyzed incorrectly or not analyzed at all# the latter seems prefer- 
able, An example of this can be seen in Figure 13B. Version 2 produces only 
a few more correct analyses than does version 1, and thus the recall results 
show very little difference. However version 2 produces many fewer incorrect 
analyses thus significantly improving the precision results. 

The results shown so far are prejudiced downward by the inclusion 
of the start-up portion of the learning process which necessarily performs 
poorly. Therefore a more important measure of system performance is a 
moving average. Figure 16 shows for each word the number of disambiguations 
performed correctly# incorrectly, and unanalyzed for each ten sentence 
group. These charts clearly indicate the anticipated poor start, the 
gradual improvement# and the final stabilization at near perfect performance. 

A 10 in the "Correct' 1 column represents perfect resolution for that sentence 
group. These statistics are summarized by Figure 17. And in Figure 18, these 
averages are shown graphically. The x axis is the interval number. Interval 
5# for example, contains inputs 41-50, etc. The y axis represents the 
number of correct analyses out of a possible 10. These charts are very 
graphic proof that the learning process builds and stabilizes at a high 
performance level. 

Several other statistics are worthy of note. Figure 19 shows for 
each run the number of spurious templates and context rules contained in 
the rule sets at the end of that run. This number is broken down to show how 
many of these spurious rules occur in the first, middle# and last third of 
their respective rule sets. These figures indicate first that most rules 
^ learned by the system are not spurious; an., secondly, that spurious rules 
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tend to be densest at the bottom of the rule sets. Thus due to the top-down 
search strategy, correct rules are rar more likely to be chosen than spurious 
ones • 

As stated previously one requirement for a good learning system is 
that it not be prone to unlearning. An put is considered to be unlearned 
if it is seen once and analyzed correctly and subsequently seen again and 
analyzed incorrectly. Figure 20 shows the number of unlearned inputs for 
each of the nine experimental runs. The low values here clearly indie *te 
that once the system has learned to disambiquate a particular input, that 
capability remains learned. Also, the fact that versions 2 and 3 perform 
better than version 1 with respect to unlearning indicates that the preven- 
tion of spurious rules is an aid in the prevention of unlearning. Unlearning 
\<x y stem from sources other chan the system itself. If a user provides 
incorrect information to a learning system, improper rules and subsequent 
unlearning may result. In an operational learning system it nay therefore 
be necessary for an analyst to review periodically the newly learned rules 
prior to their final acceptance into the permanent rule set. 

One final investigation is to look at the contents of the undesirable 
rule lists following each version 3 run. Figure 21 shows the total number of 
rules in ‘he lists and the number which by hand analysis are found to be 
actually spurious. Ideally all rules in these lists should ha spurious; and 
the figures shown are quite close to this ideal. These results show that the 
system is able to learn not only the rules which make good disambiguaters , 
but also those which are not useful, The results presented here show these pro- 
cesses are truly capable of learning to disambiguate with a high degiee of success. 

F) Extensions 

There are numerous other applications for a learning technique such 
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as; the one presented previously, A large system with many users may be able 
to learn the individual needs and techniques of its users. The system 
could thus tailor a specialised subsystem to each individual. In the area 
of information retrieval a system might be able to learn to modify techniques 
and parameters in order to improve relevance feedback performance for a 
particular collection and user. In nearly any application where a set of 
rules or parameters must be created in order to perform some form of 
analysis, the learning technique is potentially valuable, especially where 
many such sets must be created to meet the needs of many users. 

The learning process can also be applied to natural language 
analysis in the resolution of oronouns. Unlike ambiguities which have 
multiple meanings, pronouns have no meaning in isolation. To determine 
meaning, the word to which the pronoun refers must be located. This could 
be accomplished in the following way. The learning process looks at each 
noun in the vicinity of the pronoun and learns their contexts. These are 
then compared with the context of the pronoun and the noun with the best 
match used. There are of course some problems to be solved. For example, 
not all pronouns refer to a specific thing. The fact that some pronouns 
encompass large concepts or merely provide an impersonal subject can be 
seen in the second and third example sentences below. 

A. Take an egg and break it_ into a bowl. 

(specific reference) 

B. The consequence Oi: this is that the project in feasible. 

(multiple reference/ 

C. These results show that it. is possible. 

(impersonal) 

o 

provide a more accurate natural language analysis process and 
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improve performance in any natural .language application. 

6. Cone 1 0.3 ion 

This study is intended first to demonstrate the importance of disam- 
biguation in various forms of natural language analysis, and to motivate 
investigation into the automation of this process. It also serves as a 
test of the template analysis facility. The study shows that it is possible 
to perform this disambiguation with a high degree of accuracy using an 
extended form of template analysis and a predetermined set of structured 
templates and unstructured context rules. The creation of the. e rules 
requires an analyst to examine typical inputs and determine the words or 
structures which indicate the intended meaning of the ambiguous word. As 
is shown in 5 this manual process may be eliminated by implementation of 
a process which allows the system to disambiguate for itself. With the 
exception of the first few inputs for which tV-* performance is understandably 
low# the learning process demonstrates the same high degree of accuracy 
achieved with the hand made disambiguation rules. Not only is the system 
able to learn which rules provide good disambiguation, it can also deter- 
mine which rules do not, and exclude these rules from the system. The 
learning process has applications in many areas and template analysis 
appears sufficiently general to facilitate many of the applications* 
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